Abstract

Objectives: This study will extend an established model for estimating the current living wage in 2014 to the past decade for the purpose of:

  • an exploratory analysis trends in the gap between the estimated living wage and the minimum wage
  • evaluating any correlation between the living wage gap and other economic metrics, including public funds spent on social services

Methods: The original data set for this model is for 2014. This study will extend the data sources of this model into the past to enable trend analysis. Data for economic metrics from public data sources will supplement this data for correlation analysis.

Methods

Model

The original model proposed estimated the living wage in terms of 8 variables:

basic_needs_budget = food_cost + child_care_cost + ( insurance_premiums + health_care_costs ) + housing_cost + transportation_cost + other_necessities_cost

living_wage = basic_needs_budget + ( basic_needs_budget * tax_rate )

Data Sources

The following data sources are used to find estimates of the model variables:

  • The food cost is estimated from data from the USDA’s low-cost food plan national average in June 2014.
  • Child care is based off state-level estimates published by the National Association of Child Care Resource and Referral Agencies.
  • Insurance costs are based on the insurance component of the 2013 Medical Expenditure Panel Survey.
  • Housing costs are estimated from the HUD Fair Market Rents (FMR) estimates
  • Other variables are pulled from the 2014 Bureau of Labor Statistics Consumer Expenditure Survey.

These data sets extend into the past, allowing for calculating the model for years past. The data will also have to be adjusted for inflation 6.

Analytic Approach

First, data will be gathered from the data sources of the original model but will be extended into the past. The methodology followed by the model will be replicated to come up with a data set representing estimates of the living wage across time. After the data set is prepared, the trend of the living wage as compared to minimum wage can be examined. Has the gap increased or decreased over time, and at what rate? Have certain areas seen larger than average increases or decreases in this gap?

Once preliminary trend analysis is done, this data set will be analyzed in comparison to other economic trends to see if any interesting correlations can be found. Correlations to GDP growth rate and the national rate of unemployment can be made, but the primary investigation will be to see if the living wage gap correlates to national spending on SNAP (Food stamps). In other words, we will see if there is any (potentially time lagged) relationship between the living wage gap and how much the United States needs to spend to support those who cannot make ends meet. A relationship here can potentially indicate that shrinking this gap could lower public expenditures.

Presentation Of Results

Results will be presented for both parts of the data analysis. For studying the living wage gap trends, this report will present graphs of time series, aggregated in different ways, of the living wage as well as the living wage gap. Some of these time series will be presented along side data on public expenditures on SNAP to visually inspect for correlations.

Background / Sources

  • Glasmeier AK, Nadeau CA, Schultheis E: LIVING WAGE CALCULATOR User’s Guide / Technical Notes 2014 Update
  • USDA low-cost food plan, June, 2014
  • Child Care in America 2014 State fact sheets
  • 2013 Medical Expenditure Panel Survey Available
  • Consumer Expenditure Survey
  • Inflation Calculator



Pre-Data Collection

(TOC)

Lets do all of our imports now:


In [1]:
%pylab inline

# Standard stuff
import os
import itertools
from collections import OrderedDict, defaultdict
import pickle

# Data science!
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats
import statsmodels.api as sm
import weighted

# For crawling insurance data
from bs4 import BeautifulSoup

# Useful for display purposes
from prettytable import PrettyTable
# from IPython.core.display import HTML
from IPython.display import display, HTML
from pprint import pprint

# Used for determining string similarity for county names
# import editdist


# Path to local dir on my laptop
PROJECT_PATH = "/home/james/Development/Masters/Thesis"

def constant_factory(value):
    ''' Always prodcues a constant value; used fo defaultdict '''
    return itertools.repeat(value).next

def caption(msg, tablenum):
    ''' Help convert text into suitable table caption '''
    return "<br><b>Table %d - %s</b>" % (tablenum, msg)


Populating the interactive namespace from numpy and matplotlib
/home/james/Development/Masters/masters/local/lib/python2.7/site-packages/matplotlib/__init__.py:872: UserWarning: axes.color_cycle is deprecated and replaced with axes.prop_cycle; please use the latter.
  warnings.warn(self.msg_depr % (key, alt_key))

Lets setup some inflation multipliers:


In [2]:
# Multiply a dollar value to get the equivalent 2014 dollars
# Original numbers from model; used to confirm methodology matches the original model
inflation_multipliers = {
    2010: 1.092609, 
    2011: 1.059176,
    2012: 1.037701,
    2013: 1.022721,
    2014: 1.0
}

# Updated inflation numbers should scale to 2015 dollars
updated_inflation_multipliers = {
    2000: 1.3811731,
    2001: 1.3429588,
    2002: 1.3220567,
    2003: 1.2925978,
    2004: 1.2590683,
    2005: 1.2178085,
    2006: 1.179752,
    2007: 1.1470807,
    2008: 1.1046664,
    2009: 1.1086106, 
    2010: 1.0907198, 
    2011: 1.0573444,
    2012: 1.0359069,
    2013: 1.0209524,
    2014: 1.004655
}

Global identifiers used throughout the project:


In [3]:
# Constants used to refer to US regions
REGION_EAST = 'east'
REGION_MIDWEST = 'midwest'
REGION_SOUTH = 'south'
REGION_WEST = 'west'
REGION_BASE = 'base'   # Used for when a state is not in a region (Alaska, Hawaii mostly)

# Create a initial state to region mapping to use for regional weighting
state_to_region_mapping = defaultdict(constant_factory(REGION_BASE))
    
state_to_region_mapping.update(
    { 
    'PA': REGION_EAST, 'NJ': REGION_EAST, 'NY': REGION_EAST, 'CT': REGION_EAST, 'MA': REGION_EAST,
    'NH': REGION_EAST, 'VT': REGION_EAST, 'ME': REGION_EAST, 'RI': REGION_EAST, 
    'OH': REGION_MIDWEST, 'IL': REGION_MIDWEST, 'IN': REGION_MIDWEST, 'WI': REGION_MIDWEST, 'MI': REGION_MIDWEST,
    'MN': REGION_MIDWEST, 'IA': REGION_MIDWEST, 'MO': REGION_MIDWEST, 'KS': REGION_MIDWEST, 'NE': REGION_MIDWEST,
    'SD': REGION_MIDWEST, 'ND': REGION_MIDWEST,
    'TX': REGION_SOUTH, 'OK': REGION_SOUTH, 'AR': REGION_SOUTH, 'LA': REGION_SOUTH, 'MS': REGION_SOUTH,
    'AL': REGION_SOUTH, 'GA': REGION_SOUTH, 'FL': REGION_SOUTH, 'SC': REGION_SOUTH, 'NC': REGION_SOUTH,
    'VA': REGION_SOUTH, 'WV': REGION_SOUTH, 'KY': REGION_SOUTH, 'TN': REGION_SOUTH, 'MD': REGION_SOUTH,
    'DE': REGION_SOUTH,
    'CA': REGION_WEST, 'OR': REGION_WEST, 'WA': REGION_WEST, 'NV': REGION_WEST, 'ID': REGION_WEST,
    'UT': REGION_WEST, 'AZ': REGION_WEST, 'MT': REGION_WEST, 'WY': REGION_WEST, 'CO': REGION_WEST,
    'NM': REGION_WEST, 'AK': REGION_BASE, 'HI': REGION_BASE
})

# Create the inverse mapping of region to list of states
region_to_state_mapping = { }
for state, region in state_to_region_mapping.iteritems():
    if region in region_to_state_mapping:
        region_to_state_mapping[region].append(state)
    else:
        region_to_state_mapping[region] = [state]

# State to full statename mapping
state_to_statename_mapping = {u'Federal': 'federal', u'AK': u'alaska', u'AL': u'alabama', u'AR': u'arkansas', u'AZ': u'arizona', 
  u'CA': u'california', u'CO': u'colorado', u'CT': u'connecticut', u'DC': u'district of columbia', u'DE': 
  u'delaware', u'FL': u'florida', u'GA': u'georgia', u'HI': u'hawaii', u'IA': u'iowa', u'ID': u'idaho', 
  u'IL': u'illinois', u'IN': u'indiana', u'KS': u'kansas', u'KY': u'kentucky', u'LA': u'louisiana', 
  u'MA': u'massachusetts', u'MD': u'maryland', u'ME': u'maine', u'MI': u'michigan', u'MN': u'minnesota', 
  u'MO': u'missouri', u'MS': u'mississippi', u'MT': u'montana', u'NC': u'north carolina', u'ND': u'north dakota', 
  u'NE': u'nebraska', u'NH': u'new hampshire', u'NJ': u'new jersey', u'NM': u'new mexico', u'NV': u'nevada', 
  u'NY': u'new york', u'OH': u'ohio', u'OK': u'oklahoma', u'OR': u'oregon', u'PA': u'pennsylvania', 
  u'RI': u'rhode island', u'SC': u'south carolina', u'SD': u'south dakota', u'TN': u'tennessee', u'TX': u'texas', 
  u'UT': u'utah', u'VA': u'virginia', u'VT': u'vermont', u'WA': u'washington', u'WI': u'wisconsin', 
  u'WV': u'west virginia', u'WY': u'wyoming'}

statename_to_state_mapping = {statename:state for (state, statename) in state_to_statename_mapping.iteritems()}

Lets setup regional differences for the food data:


In [4]:
# Multiply price of food by regional multipler to get better estimate of food costs
food_regional_multipliers = {
    REGION_EAST: 0.08,
    REGION_WEST: 0.11,
    REGION_SOUTH: -0.07,
    REGION_MIDWEST: -0.05,
}

Create list of values for the years we will work with in the model


In [5]:
model_years = range(2004, 2015)

Some useful FIPS codes:


In [6]:
# FIPS code for NYC Counties Kings 
newyork_county = 3606199999
kings_county = 3604799999
queens_county = 3608199999
bronx_county = 3600599999
richmond_count = 3608599999
nyc_counties = [newyork_county, kings_county, queens_county, bronx_county, richmond_count]

# FIPS code for Trenton, NJ
mercer_county = 3402199999

# Fips code for Orange county
orange_county = 601399999

Global knobs for visuals


In [7]:
global_fontweight = 'bold'
global_fontsize = 20
# global_theme = "whitegrid"
global_theme= "darkgrid"
global_theme_options = {"axes.facecolor": ".9"}

current_color_palette = sns.color_palette()
husl_color_palette = sns.color_palette("husl", len(model_years))

def setup_custom_visuals():
    sns.set_style(global_theme, global_theme_options)

    font = {'weight' : global_fontweight,
            'size'   : global_fontsize}
    tick = {
        'labelsize' : 12  # fontsize of the x any y labels
    }

    rc('font', **font)  # pass in the font dict as kwargs
    rc('xtick', **tick)
    rc('ytick', **tick)
    
def setup_default_visuals():
    rcdefaults()
    sns.set_style(global_theme, global_theme_options)

setup_default_visuals()
setup_custom_visuals()


/home/james/Development/Masters/masters/local/lib/python2.7/site-packages/matplotlib/__init__.py:892: UserWarning: axes.color_cycle is deprecated and replaced with axes.prop_cycle; please use the latter.
  warnings.warn(self.msg_depr % (key, alt_key))



Data Collection

(TOC)

The following sections will outline how I gathered the data for the various model parameters as well as other data we need to calculate their values. The original model was made for 2014 data and extending this data to the past means we need to be careful that any changes in the underlying data methodology of these parameters needs to be noted.

Data Sources

Consumer Expenditure Report

Wget commands used to get the Consumer Expenditure Reports:


In [ ]:
# Get CEX for 2013 and 2014 (XLSX format)
for i in `seq 2013 2014`; do wget http://www.bls.gov/cex/$i/aggregate/cusize.xlsx -O ${i}_cex.xlsx; done

# Get CEX for 2004 - 2012 (XLS format)
for i in `seq 2004 2012`; do wget http://www.bls.gov/cex/$i/aggregate/cusize.xls -O ${i}_cex.xls; done

# Get CEX for 2001 to 2003 (TXT format)
for i in `seq 2001 2003`; do wget http://www.bls.gov/cex/aggregate/$i/cusize.txt -O ${i}_cex.txt; done


# Get CEX region for 2013 and 2014 (XLSX format)
for i in `seq 2013 2014`; do wget http://www.bls.gov/cex/$i/aggregate/region.xlsx -O ${i}_region_cex.xlsx; done

# Get CEX region for 2004 - 2012 (XLS format)
for i in `seq 2004 2012`; do wget http://www.bls.gov/cex/$i/aggregate/region.xls -O ${i}_region_cex.xls; done

# Get CEX region for 2001 to 2003 (TXT format)
for i in `seq 2001 2003`; do wget http://www.bls.gov/cex/aggregate/$i/region.txt -O ${i}_region_cex.txt; done

USDA Food Plans

Wget commands used to gather data files:


In [ ]:
# Change command to get '10 - '15
for i in {1..9}; do  wget http://www.cnpp.usda.gov/sites/default/files/usda_food_plans_cost_of_food/CostofFoodJun0$i.pdf; done

Free Market Rent Data From HUD

Below are the wget commands for getting the FMR data


In [ ]:
cd data/fmr
for i in `seq 2014 2015`; do wget http://www.huduser.gov/portal/datasets/fmr/fmr${i}f/FY${i}_4050_RevFinal.xls -O fmr${i}.xlsx; done
for i in `seq 2010 2013`; do wget http://www.huduser.gov/portal/datasets/fmr/fmr${i}f/FY${i}_4050_Final.xls -O fmr${i}.xlsx; done
for i in `seq 2009 2009`; do wget http://www.huduser.gov/portal/datasets/fmr/fmr${i}r/FY${i}_4050_Rev_Final.xls -O fmr${i}.xlsx; done

# GRRRR
wget http://www.huduser.gov/portal/datasets/fmr/fmr2008r/FMR_county_fy2008r_rdds.xls
wget http://www.huduser.gov/portal/datasets/fmr/fmr2007f/FY2007F_County_Town.xls
wget http://www.huduser.gov/portal/datasets/fmr/fmr2006r/FY2006_County_Town.xls
wget http://www.huduser.gov/portal/datasets/fmr/fmr2005r/Revised_FY2005_CntLevel.xls
wget http://www.huduser.gov/portal/datasets/FMR/FMR2004F/FMR2004F_County.xls
wget http://www.huduser.gov/portal/datasets/fmr/FMR2003F_County.xls
wget http://www.huduser.gov/portal/datasets/fmr/FMR2002F.xls

In [23]:
# Counties dict will map county ID to useful infomation, mostly region and state
counties = { }

Medical Expenditure Panel Survey from the AHRQ

Below are the wget commands used to download this data. This data will have to be further parsed from HTML.


In [ ]:
# Load insurance data
cd data/insurance
for i in `seq 2001 2014`; do 
    wget -O ${i}_tiic2.html http://meps.ahrq.gov/mepsweb/data_stats/summ_tables/insr/state/series_2/${i}/tiic2.htm ;
done

Tax Data

Here is all the files we need for tax data:


In [ ]:
# Data from Tax Foundation on individual tax rates per state per year
cd data/taxes
wget -O State_Individual_Income_Tax_Rates_2000-2014.xlsx http://taxfoundation.org/sites/taxfoundation.org/files/docs/State%20Individual%20Income%20Tax%20Rates%2C%202000-2014.xlsx

Median Wage Data


In [ ]:
# Download data from census website
cd data/census/SAIPE
for i in `seq 10 14`; do wget http://www.census.gov/did/www/saipe/downloads/estmod${i}/est${i}ALL.xls ; done
for i in `seq 4 9`; do wget http://www.census.gov/did/www/saipe/downloads/estmod0${i}/est0${i}ALL.xls ; done

Minimum Wage Data

TODO: Fill in


In [ ]:




Model Variables

TOC

Housing Costs

Definition from the model:

We assumed that a one adult family would rent a single occupancy unit (zero bedrooms) for an individual adult household, that a two adult family would rent a one bedroom apartment,

The counties are identified by the FIPS code, which is just state code + county code + subcounty code (only post 2005).

We need to do some string matching to find FIPS codes for 2002, since they are not in the file. Exact matches work for 84% of the data. The other data is filled in via finding name with smallest levishtein distance. Used py-editdist instead of nltk's implementation due to speed issues.

Final data can be found in the Appendix: Housing Costs Data Table.

Methodology Confidence

TODO

TODO

  • Fix 2002 issue
  • Look into 2005 and 2006 transition

In [12]:
# Fair Market Rent data
fmr_data = { }

def pad_county(county):
    ''' Pad counties to three digits when we need to construct one manually. '''
    return '%03d' % county

def pad_fips(fip):
    ''' Add 99999 to end of fip code (which nullifies the subcounty identifier) '''
    return int(str(fip) + '99999')

In [13]:
# For now, loading 2002 - 2014
for year in range(2002, 2015):
    with open(PROJECT_PATH + "/data/fmr/fmr%d.csv" % year, 'rb') as csvfile:
        # Store dataframe from csv into dict
        fmr_data[year] = pd.read_csv(csvfile)
        
        # Lower case headings to make life easier
        fmr_data[year].columns = map(str.lower, fmr_data[year].columns)
        
        # Custom processing per year
        if year > 2012:
            fmr_data[year] = fmr_data[year][["fmr1", "county", "cousub", "countyname", "fips2010", "pop2010", "state", "state_alpha"]]
            fmr_data[year].rename(columns={'fips2010':'fips', 'pop2010': 'pop'}, inplace=True)
            # fmr_data[year] = fmr_data[year].query('cousub == 99999').reset_index(drop=True)
            fmr_data[year]["fips"] = fmr_data[year]["fips"].map(lambda x: (x / 100000) * 100000 + 99999)
            fmr_data[year] = fmr_data[year].drop_duplicates(subset=["state", "countyname"])
        elif year > 2005:
            fmr_data[year] = fmr_data[year][["fmr1", "county", "cousub", "countyname", "fips", "pop2000", "state", "state_alpha"]]
            fmr_data[year].rename(columns={'pop2000': 'pop'}, inplace=True)
            # fmr_data[year] = fmr_data[year].query('cousub == 99999').reset_index(drop=True)
            fmr_data[year]["fips"] = fmr_data[year]["fips"].map(lambda x: (x / 100000) * 100000 + 99999)
            fmr_data[year] = fmr_data[year].drop_duplicates(subset=["state", "countyname"])
        elif year == 2005:
            fmr_data[year] = fmr_data[year][["fmr_1bed", "county", "countyname", "pop2000", "state", "state_alpha", "stco"]]
            fmr_data[year].rename(columns={'stco':'fips', 'fmr_1bed': 'fmr1', 'pop2000': 'pop'}, inplace=True)
            fmr_data[year]['fips'] = fmr_data[year]['fips'].map(pad_fips)
        elif year == 2004:
            fmr_data[year] = fmr_data[year][["new_fmr1", "county", "countyname", "pop100", "state", "state_alpha"]]
            fmr_data[year]['fips'] = fmr_data[year]['state'].map(str) + fmr_data[year]['county'].map(pad_county)
            fmr_data[year].rename(columns={'new_fmr1': 'fmr1', 'pop100': 'pop'}, inplace=True)
            fmr_data[year]['fips'] = fmr_data[year]['fips'].map(pad_fips)
        elif year == 2003:
            fmr_data[year] = fmr_data[year][["fmr1", "county", "countyname", "pop", "state", "state_alpha"]]
            fmr_data[year]['fips'] = fmr_data[year]['state'].map(str) + fmr_data[year]['county'].map(pad_county)
            fmr_data[year]['fips'] = fmr_data[year]['fips'].map(pad_fips)
        elif year == 2002:
            # NOTE: we have to calculate FIPS codes by hand in cell below
            fmr_data[year] = fmr_data[year][["fmr1br", "areaname", "st"]]
            fmr_data[year].rename(columns={'st':'state_alpha', 'fmr1br': 'fmr1', 'areaname': 'countyname'}, inplace=True)

        # Inflation
        fmr_data[year]['fmr1_inf'] = fmr_data[year]['fmr1'] * updated_inflation_multipliers[year]
        
        # Add region column
        # METHOD: the defaultdict will use region_base if the state is not in the initial state to region mapping
        fmr_data[year]['region'] = fmr_data[year]['state_alpha'].map(lambda x: state_to_region_mapping[x])

In [243]:
# year = 2004
# f = open(PROJECT_PATH + "/data/fmr/fmr%s.csv" % year, 'rb')
# df = pd.read_csv(f)
# df.columns = map(str.lower, df.columns)
# df = df[["new_fmr1", "county", "countyname", "pop100", "state", "state_alpha"]]

# df['fips'] = df['state'].map(str) + df['county'].map(pad_county)
# df.rename(columns={'new_fmr1': 'fmr1'}, inplace=True)
# df['fips'] = df['fips'].map(pad_fips)
# df[ df['state_alpha'] == 'NH' ]

In [ ]:
# year = 2005
# f = open(PROJECT_PATH + "/data/fmr/fmr%s.csv" % year, 'rb')
# df2 = pd.read_csv(f)
# df2.columns = map(str.lower, df2.columns)

# df2 = df2[["fmr_1bed", "county", "countyname", "pop2000", "state", "state_alpha", "stco"]]
# df2.rename(columns={'stco':'fips', 'fmr_1bed': 'fmr1'}, inplace=True)
# df2['fips'] = df2['fips'].map(pad_fips)

# df2[ df2['state_alpha'] == 'NH' ]
# fmr_data[2005][fmr_data[2005]['state_alpha'] == 'NH']

In [14]:
# set(fmr_data[2005]['state_alpha'].values) - set(fmr_data[2006]['state_alpha'].values)


Out[14]:
{'CT', 'MA', 'ME', 'NH', 'RI', 'VT'}

In [342]:
##### Handle 2002 data ######

# Custom comparator to compare column of strings to given string
def compare_lambda(y):
    def compare(x):
        return (x[0], x[1], editdist.distance(x[1], y), x[2])
    return compare

# Init list of fips we need to find and a bitmap of which 2002 counties we processes
fips = [ None ] * len(fmr_data[2002]['countyname'])
found_bitmap = [ False ] * len(fmr_data[2002]['countyname'])

# For each count in 2002 ...
# for idx, countyname in enumerate(fmr_data[2002]['countyname']):
for idx, county_and_state in enumerate(fmr_data[2002][['countyname', 'state_alpha']].values.tolist()):
    (countyname, state) = county_and_state
    
    # See if any row mathes this countyname exactly
    county_matches = fmr_data[2003][ (fmr_data[2003]['state_alpha'] == state) & (fmr_data[2003]['countyname'].str.lower() == countyname.lower()) ]
    
    if len(county_matches) > 0:
        fips_val = county_matches['fips'].values[0]
        found_bitmap[idx] = True
        fips[idx] = fips_val

In [ ]:
##### Handle 2002 data ######

# For each county in 2002 ...
# for idx, countyname in enumerate(fmr_data[2002]['countyname']):
for idx, county_and_state in enumerate(fmr_data[2002][['countyname', 'state_alpha']].values.tolist()):
    (countyname, state) = county_and_state
        
    # If already matched, we skip; otherwise ...
    if not found_bitmap[idx]:
        # Get list of counties (as tuples) in 2003 which we try to match to
        # good_counties = list(enumerate(fmr_data[2003]['countyname']))
        good_counties = list(enumerate([ list(i) for i in fmr_data[2003][ fmr_data[2003]['state_alpha'] == state ][['countyname', 'fips']].values]))

        if len(good_counties) == 0:
            continue
        
        # Get list of distances from 2002 countyname to all 2003 countynames
        # NOTE: use of compare_lambda to create custom comparator that also 
        # returns data in (idx, countyname, levdist) form
        distances = map(compare_lambda(countyname.lower()), 
                        map(lambda x: (x[0], x[1][0].lower(), x[1][1]), 
                            good_counties))
        
        # Find the minimum distance (with custom key to only compare third element, which is levdist)
        min_distance = min(distances, key=lambda x: x[2])

        
        # Update bitmap and store appropriate FIPS code from 2003 
        found_bitmap[idx] = True
        fips[idx] = min_distance[3]
        
# Add calculated fips to new column in 2002
fmr_data[2002]['fips'] = fips

In [75]:
##### Construct final multi-level dataframe #####

fmr_df = pd.DataFrame()
for year in model_years:
    mindex = pd.MultiIndex.from_tuples(zip([year]*len(fmr_data[year]), fmr_data[year]['fips']), names=["year", "fips"])
    new_df = fmr_data[year]
    new_df.index = mindex
    new_df.columns = fmr_data[year].columns
    fmr_df = pd.concat([fmr_df, new_df])

# fmr_df.to_csv("fmr_df.v1.csv")

In [620]:


In [248]:
# [ list(i) for i in fmr_data[2003][ fmr_data[2003]['state_alpha'] == 'OR' ][['countyname', 'fips']].values]

# fmr_data[2002][ fmr_data[2002]['state_alpha'] == 'OR' ][['countyname', 'fips']].values

# np.sum(found_bitmap) / float(len(found_bitmap))

# print len(set(fmr_data[2002][ fmr_data[2002]['state_alpha'] == 'NY' ]['fips']))
# print len(set(fmr_data[2003][ fmr_data[2003]['state_alpha'] == 'NY' ]['fips']))

# set(fmr_data[2003][ fmr_data[2003]['state_alpha'] == 'NY' ]['fips']).difference(set(fmr_data[2002][ fmr_data[2002]['state_alpha'] == 'NY' ]['fips']))

Filter out FIPS that are consistent

Seems like data from 2002 - 2003 is iffy with respect to matching counties; restricting the analysis from 2004 for now, until I can figure out a way to do better mapping


In [14]:
countyset = set(fmr_data[2004]['fips'])
# print len(countyset)

# Create set that is the intersection of all fips values
for year in range(2005, 2014):
    countyset = countyset.intersection(set(fmr_data[year]['fips']))

# Filter out each dataframe based on above set
for year in range(2005, 2014):
    fmr_data[year] = fmr_data[year][ fmr_data[year].isin({'fips': countyset})['fips'].values ]

# Confirm
# print fmr_data[2004].sort('fips')
# print fmr_data[2005].sort('fips')

Issue with county change from 2005 to 2006


In [15]:
for year in model_years:
    x = set(fmr_data[year]['fips'])
    y = set(fmr_data[year+1]['fips'])
    print("Diff between %d and %d is: %s" % (year, year+1, len(y.difference(x))))
    print("Diff between %d and %d is: %s" % (year, year+1, len(x.difference(y))))
    print

# print(list(set(fmr_data[2005]['fips']))[0:10])
# print(list(set(fmr_data[2006]['fips']))[0:10])

print set(fmr_data[2006]['fips']).difference(set(fmr_data[2005]['fips']))


Diff between 2004 and 2005 is: 0
Diff between 2004 and 2005 is: 103

Diff between 2005 and 2006 is: 0
Diff between 2005 and 2006 is: 0

Diff between 2006 and 2007 is: 0
Diff between 2006 and 2007 is: 0

Diff between 2007 and 2008 is: 0
Diff between 2007 and 2008 is: 0

Diff between 2008 and 2009 is: 0
Diff between 2008 and 2009 is: 0

Diff between 2009 and 2010 is: 0
Diff between 2009 and 2010 is: 0

Diff between 2010 and 2011 is: 0
Diff between 2010 and 2011 is: 0

Diff between 2011 and 2012 is: 0
Diff between 2011 and 2012 is: 0

Diff between 2012 and 2013 is: 0
Diff between 2012 and 2013 is: 0

Diff between 2013 and 2014 is: 56
Diff between 2013 and 2014 is: 0

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-15-1b88821355e8> in <module>()
      1 for year in model_years:
      2     x = set(fmr_data[year]['fips'])
----> 3     y = set(fmr_data[year+1]['fips'])
      4     print("Diff between %d and %d is: %s" % (year, year+1, len(y.difference(x))))
      5     print("Diff between %d and %d is: %s" % (year, year+1, len(x.difference(y))))

KeyError: 2015

Most Populous Counties

This article from Business Insider lists the top 150 counties by population.


In [705]:
# Taken from link above
most_populous_counties = [("Los Angeles County","CA"),("Cook County", "IL"),("Harris County", "TX"),("Maricopa County", "AZ"),("San Diego County","CA"),("Orange County", "CA"),("Miami-Dade County","FL"),("Kings County", "NY"),("Dallas County", "TX"),("Queens County", "NY"),("Riverside County", "CA"),("San Bernardino County","CA"),("King County", "WA"),("Clark County", "NV"),("Tarrant County", "TX"),("Santa Clara County","CA"),("Broward County", "FL"),("Wayne County", "MI"),("Bexar County", "TX"),("New York County","NY"),("Alameda County", "CA"),("Philadelphia County", "PA"),("Middlesex County", "MA"),("Suffolk County", "NY"),("Sacramento County", "CA"),("Bronx County", "NY"),("Palm Beach County","FL"),("Nassau County", "NY"),("Hillsborough County", "FL"),("Cuyahoga County", "OH"),("Allegheny County", "PA"),("Oakland County", "MI"),("Orange County", "FL"),("Franklin County", "OH"),("Hennepin County", "MN"),("Fairfax County", "VA"),("Travis County", "TX"),("Contra Costa County","CA"),("Salt Lake County","UT"),("Montgomery County", "MD"),("St. Louis County", "MO"),("Pima County", "AZ"),("Fulton County", "GA"),("Honolulu County", "HI"),("Mecklenburg County", "NC"),("Westchester County", "NY"),("Milwaukee County", "WI"),("Wake County", "NC"),("Fresno County", "CA"),("Shelby County", "TN"),("Fairfield County", "CT"),("DuPage County", "IL"),("Pinellas County", "FL"),("Erie County", "NY"),("Marion County", "IN"),("Bergen County", "NJ"),("Hartford County", "CT"),("Prince George's County", "MD"),("Duval County", "FL"),("New Haven County","CT"),("Kern County", "CA"),("Macomb County", "MI"),("Gwinnett County", "GA"),("Ventura County", "CA"),("Collin County", "TX"),("El Paso County","TX"),("San Francisco County","CA"),("Middlesex County", "NJ"),("Baltimore County", "MD"),("Pierce County", "WA"),("Montgomery County", "PA"),("Hidalgo County", "TX"),("Worcester County", "MA"),("Hamilton County", "OH"),("Essex County", "NJ"),("Multnomah County", "OR"),("Essex County", "MA"),("Jefferson County", "KY"),("Monroe County", "NY"),("Suffolk County", "MA"),("Oklahoma County", "OK"),("San Mateo County","CA"),("Snohomish County", "WA"),("Cobb County", "GA"),("Denton County", "TX"),("DeKalb County", "GA"),("San Joaquin County","CA"),("Lake County", "IL"),("Will County", "IL"),("Norfolk County", "MA"),("Jackson County", "MO"),("Bernalillo County", "NM"),("Jefferson County", "AL"),("Hudson County", "NJ"),("Davidson County", "TN"),("Lee County", "FL"),("El Paso County","CO"),("Denver County", "CO"),("District of Columbia","DC"),("Monmouth County", "NJ"),("Providence County", "RI"),("Fort Bend County","TX"),("Bucks County", "PA"),("Baltimore city", "MD"),("Polk County", "FL"),("Kent County", "MI"),("Tulsa County", "OK"),("Arapahoe County", "CO"),("Ocean County", "NJ"),("Delaware County", "PA"),("Johnson County", "KS"),("Bristol County", "MA"),("Anne Arundel County","MD"),("Washington County", "OR"),("Brevard County", "FL"),("New Castle County","DE"),("Jefferson County", "CO"),("Union County", "NJ"),("Summit County", "OH"),("Utah County", "UT"),("Montgomery County", "OH"),("Douglas County", "NE"),("Lancaster County", "PA"),("Kane County", "IL"),("Stanislaus County", "CA"),("Ramsey County", "MN"),("Camden County", "NJ"),("Chester County", "PA"),("Sedgwick County", "KS"),("Dane County", "WI"),("Passaic County", "NJ"),("Guilford County", "NC"),("Plymouth County", "MA"),("Morris County", "NJ"),("Volusia County", "FL"),("Lake County", "IN"),("Sonoma County", "CA"),("Montgomery County", "TX"),("Spokane County", "WA"),("Richmond County", "NY"),("Pasco County", "FL"),("Greenville County", "SC"),("Onondaga County", "NY"),("Hampden County", "MA"),("Adams County", "CO"),("Williamson County", "TX")]

most_populous_fips = []

for (county, state) in most_populous_counties:
    selector = fmr_data[2014][ fmr_data[2014]['state_alpha'] == state ]['countyname'].map(str.lower)  == county.lower()
    if not np.any(selector):
        print "missed %s %s" % (county, state)
    else:
        most_populous_fips.append( fmr_data[2014][ fmr_data[2014]['state_alpha'] == state ][selector]['fips'].values[0] )

In [706]:
print len(most_populous_counties)
print len(most_populous_fips)


146
146

In [11]:
# Serialize
# pickle.dump(most_populous_fips, open(os.path.join(PROJECT_PATH, "data/most_populous_fips.v2.csv"), "wb"))

# De-serialize
most_populous_fips = pickle.load(open(os.path.join(PROJECT_PATH, "data/most_populous_fips.v2.csv"), "rb"))

Food Costs

(TOC)

Data for the food calculations have been successfully downloaded in PDF form. The main way to calculate this is, from the PDF:

Adult food consumption costs are estimated by averaging the low - cost plan food costs for males and females between 19 and 50

Note, we add 20% to the values from the data sheets, since the notes on all published PDFs from the USDA state to add 20% to the listed values for individuals since:

The costs given are for individuals in 4-person families. For individuals in other size families, the following adjustments are suggested: 1-person—add 20 percent; ...

The notes for the model also state that regional weights are applied to give a better estimate for food costs across the nation. The result of this section are values for 2014 that match exactly to the data given on the model website, so I am confident the implementation of the methodology below is correct.

The final data can be seen in the Appendix: Food Costs Data Table

Notes: Change of USDA Methodology

In 2006, the data from the USDA changed the age ranges for their healthy meal cost calculations. The differences in range are minimal and should not effect overall estimations.

Methodology Confidence

The methodology of this section produces numbers exactly like the original model, so the confidence in the methodology is high.


In [625]:
# The base food cost (not regionally weighed) for nation (data pulled manually from PDFs)
national_monthly_food_cost_per_year = {
    2014: {"base": np.average([241.50, 209.80])},
    2013: {"base": np.average([234.60, 203.70])},
    2012: {"base": np.average([234.00, 203.00])},
    2011: {"base": np.average([226.80, 196.90])},
    2010: {"base": np.average([216.30, 187.70])},
    2009: {"base": np.average([216.50, 187.90])},
    2008: {"base": np.average([216.90, 189.60])},
    2007: {"base": np.average([200.20, 174.10])},
    2006: {"base": np.average([189.70, 164.80])},
    2005: {"base": np.average([186.20, 162.10])},
    2004: {"base": np.average([183.10, 159.50])},
    2003: {"base": np.average([174.20, 151.70])},
    2002: {"base": np.average([170.30, 148.60])},
    2001: {"base": np.average([166.80, 145.60])},
}

# Create ordered dict to make sure we process things in order
national_monthly_food_cost_per_year = OrderedDict(sorted(national_monthly_food_cost_per_year.items(), 
                                                        key=lambda t: t[0]))

# Adjust the data according to notes above
for year in national_monthly_food_cost_per_year:
    # Inflation and 20% adjustment
    national_monthly_food_cost_per_year[year]["base"] = \
        national_monthly_food_cost_per_year[year]["base"] * 1.20 * updated_inflation_multipliers[year]

    # Regional adjustment
    for region in food_regional_multipliers:
        national_monthly_food_cost_per_year[year][region] = \
            national_monthly_food_cost_per_year[year]["base"] * (1 + food_regional_multipliers[region])

national_monthly_food_cost_per_year_df = pd.DataFrame.from_dict(national_monthly_food_cost_per_year)

Child Care Cost

(TOC)

Manually download PDFs from ChildCareAware.org. Sadly, they only go back to 2010. I can now either:

  • have to find other estimates of child care costs from pre-2010 (prefered)
  • check if the Consumer Expenditure Survey has data on this
  • impute the data (dont think this is a good idea)
  • limit the analysis going back to 2010 (which seems limiting since other data, like the Consumer Expenditure Survey in 2014 provides 2013 data and that is the latest currently).

Currently I am only focusing on modeling costs for a single adult (an assumption I made early on) since I am interested in trends, and the other 'family configurations' are just linear combinations of the costs for one adult and for one child. However if I wanted to extend the numbers for 1 adult + 1 child, I would have to look into this further. For now I'll move on.

Health Insurance Costs

(TOC)

The model uses data from the Medical Expenditure Panel Survey from the Agency for Healthcare Research and Quality (searchable here). Specifically, the model assumes a single adult's insurance costs are best estimated from Table X.C.1 Employee contribution distributions (in dollars) for private-sector employees enrolled in single coverage. This survey gives the mean cost for a single adult per state.

Table X.C.1 was only added to the survey starting in 2006. There is an alternative table that appears in all years (Table II.C.2: Average total employee contribution (in dollars) per enrolled employee for single coverage at private-sector establishments), which is what is downloaded from the previous section.

One problem is that in 2007 this survey was not done. I solved this by linearly impute data from 2006 and 2008, which seems reasonable if we can assume that costs tend to go up every year and not go down. This is true for the data I have looked at.

Another problem is that some states do not appear in the earlier data due to funding issues (and not being able to get a statistically significant sample). I fix this by using the value in the data for 'states not specified' and fill in the missing states.

Below is code on processing each html file.

Final table shown in Appendix: Insurance Costs Data Table


In [627]:
# Process HTML files with BeautifulSoup
insurance_costs = {}
insurance_costs_path = os.path.join(PROJECT_PATH, "data/insurance")

# Loop thru all the files
for filename in [f for f in os.listdir(insurance_costs_path) if f.endswith('.html')]:
    states = {}
    
    # File is for what year?
    year = int(filename.split('_')[0])
    
    # Open file
    full_filename = os.path.join(insurance_costs_path, filename)
    f = open(full_filename, "r")
    
    # Import into BeautifulSoup
    data = f.readlines()
    soup = BeautifulSoup(''.join(data))

    # Works for years 2008 - 2014
    if year in range(2008, 2015):
        for tr in soup.find_all('tr'):
            # State is located in the TR element
            state = tr.get_text().split("\n")[1].lower().strip()
            
            # Find the data, but if you can't, skip it
            td = tr.find_all('td')
            value = None
            if td: 
                try:
                    value = float(td[0].get_text().strip().replace(",", ""))
                    
                    # Account for inflation and round up
                    value = float(np.round(value * updated_inflation_multipliers[year]))
                except ValueError as e:
                    continue

                # We need to stop processing after the first chunk or if we couldnt get a value
                if state not in states and value:
                    states[state] = value
    # Works for 2001 - 2006
    elif year in range(2001, 2007):
        for tr in soup.find_all('tr'):
            td = tr.find_all('td')

            value = None
            if len(td) > 2: 
                # Same as above, but state is fist TD, not in TR
                state = td[0].get_text().lower().strip()
                try:
                    value = float(td[1].get_text().strip().replace(",", ""))
                    
                    # Account for inflation and round up
                    value = float(np.round(value * updated_inflation_multipliers[year]))
                except ValueError as e:
                    continue

            if state not in states and value:
                states[state] = value
    else:
        pass

    # Add data from file to global dict
    insurance_costs[year] = states

    
# For each state in 2007, linearly impute the data
insurance_costs[2007] = { }
for state in insurance_costs[2014]:
    insurance_costs[2007][state] = (insurance_costs[2006][state] + insurance_costs[2008][state]) / 2.0

def state_filter(state):
    ''' Filter out some entries from the html that we pick up as states'''
    return "district" not in state and 'united' not in state and 'separately' not in state

# Get all states in 2014, assuming thats the largest set of states
full_set_of_states = set([state for state in sorted(insurance_costs[2014].keys()) if state_filter(state)])
for year in range(2001, 2015):   
    # Find current set of states from this year
    current_set_of_states = set([state for state in sorted(insurance_costs[year].keys()) if state_filter(state)])
    
    # Find difference between states we have now and states in 2014
    diff = full_set_of_states.difference(current_set_of_states)
    
    # If there are some states missing, fill in those states with given value from "States not shown separately" in data
    if diff and 'states not shown separately' in insurance_costs[year]:
        # Fill in each state
        for state in list(diff):
            insurance_costs[year][state] = insurance_costs[year]['states not shown separately']

# Create final dataframe results for this  model variable
insurance_costs_df = pd.DataFrame(insurance_costs)

Transportation Costs

(TOC)

Looking at the (1) Cars and trucks (used), (2) gasoline and motor oil, (3) other vehicle expenses, and (4) public transportation fields under "Transportation" in the 2014 Consumer Expenditure Report, we can pull out information from each to model the calculation done in the original model. For each sub-variable, we get the amount of money (in millions) and the percentage of that that single adults spend. After multiple those numbers (accounting for units) and dividing by the total number of single adults in the survey gives us a mean total cost per adult.

The original model takes into account regional drift by scaling based on each regions. NOTE: See todo in this section

Data can bee seen in full in Appendix: Transportation Costs Data Tables

TODO:

  • Figure out how to do regional differences correctly. Emailed model creator for clarification
  • Since this data reflects conditions in 2013, we account for inflation to get the 2014 estimate that is produced in the original model. Is this correct?
  • See if there is any issue in using totals for Apparel and services, versus men over 16 wrt regional weighting

Health Care Costs

(TOC)

The health component of the basic needs budget includes: (1) health insurance costs for employer sponsored plans, (3) medical services, (3) drugs, and (4) medical supplies. 8 Costs for medical services, drugs and medical supplies were derived from 2013 national expenditure estimates by household size provided in the 2014 Bureau of Labor Statistics Consumer Expenditure Survey

Data can be seen in full in Appendix: Health Care Costs

Other Necessities Cost

(TOC)

Expenditures for other necessities are based on 2013 data by household size from the 2014 Bureau of Labor Statistics Consumer Expenditure Survey including: (1) Apparel and services, (2) Housekeeping supplies, (3) Personal care products and services, (4) Reading, and (5) Miscellaneous. These costs were further adjusted for regional differences using annual expenditure shares reported by region

The data can be seen in full Appendix: Other Necessities Cost


In [628]:
# Created this CSV by hand by reading in appropriate data from each CEX survey
cex_full_data = pd.DataFrame.from_csv("data/cex_survey/cex_data.csv")

cex_other_columns          = ['housekeeping', 'apparel and services', 'personal care', 'reading', 'misc']
cex_health_columns         = ['healthcare']
cex_transportation_columns = ['cars and trucks used', 'gasoline and motor oil', 'others', 'public']

transportation_costs = { }
health_costs = { }
other_costs = { }

# For every year in the CEX data ...
for year in range(2001, 2015):
    # Convert int year to str for indexing; store number of adults in single adults field
    str_year = str(year)
    num_adults = cex_full_data.loc["number of adults"].loc[str_year]

    # Calc transportation model variable
    transportation_cost = 0.0
    for cex_column in cex_transportation_columns:
        transportation_cost += (cex_full_data.loc[cex_column].loc[str_year] * 1000000.0 * \
                        cex_full_data.loc[cex_column + ' percent'].loc[str_year] * 0.01 ) / float(num_adults * 1000)
    # print transportation_costs  * inflation_multipliers[year - 1]
    transportation_costs[year] = {'transportation_costs': transportation_cost * updated_inflation_multipliers[year]}
    
    
    # Calc health model variable
    health_cost = 0.0
    for cex_column in cex_health_columns:
        health_cost += (cex_full_data.loc[cex_column].loc[str_year] * 1000000.0 * \
                        cex_full_data.loc[cex_column + ' percent'].loc[str_year] * 0.01 ) / float(num_adults * 1000)
    # print health_costs  * inflation_multipliers[year-1]
    health_costs[year] = {'health_costs': health_cost * updated_inflation_multipliers[year]}
    
    
    # Calc other model variable
    other_cost = 0.0
    for cex_column in cex_other_columns:
        other_cost += (cex_full_data.loc[cex_column].loc[str_year] * 1000000.0 * \
                        cex_full_data.loc[cex_column + ' percent'].loc[str_year] * 0.01 ) / float(num_adults * 1000) 
    # print other_costs  * inflation_multipliers[year-1]
    other_costs[year] = {'other_costs': other_cost * updated_inflation_multipliers[year]}

# Create Data frames
transportation_costs_df = pd.DataFrame(transportation_costs)
health_costs_df = pd.DataFrame.from_dict(health_costs)
other_costs_df = pd.DataFrame.from_dict(other_costs)

Taxes Data

(TOC)

From the model documentation:

Estimates for payroll taxes, state income tax, and federal income tax rates are included in the calculation of a living wage. Property taxes and sales taxes are already represented in the budget estimates through the cost of rent and other necessities.

All tax data can be found in the Appendix: Tax Data Tables.

Lets look at the other tax break downs:

Payroll Taxes

A flat payroll tax and state income tax rate is applied to the basic needs budget. Payroll tax is a nationally representative rate as specified in the Federal Insurance Contributions Act.

The payroll tax rate (Social Security and Medicare taxes) is 6.2% of total wages as of 2014.

I am not sure where the model gets 6.2% from. The data from the SSA website states that 6.2% is the rate for the Social Security part of the FICA tax. This might be a mistake in the original model. I will use 6.2% for any work in confirming how close I am to the real model, but will use the combined rate (which includes Medicare's Hospital Insurance rate) when calculating final numbers for my model.

Another thing to note is that in 2011 and 2012, the rate for the Social Security part of the FICA tax was 2% lower for individuals.


In [629]:
# Data from FICA rates
updated_fica_tax_rate = dict(zip(
        [2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014],
        [0.0765, 0.0765, 0.0765, 0.0765, 0.0765, 0.0765, 0.0765, 0.0765, 0.0765, 0.0765, 0.0765, 0.0565, 0.0565, 0.0765, 0.0765]))

# Create dataframe
updated_fica_tax_rate_df = pd.DataFrame.from_dict({"fica rate": updated_fica_tax_rate}).transpose()

# Data that the model used (see notes above)
fica_tax_rate = {
    2013: 0.062
}

State Tax Rate

The state tax rate is taken from the second lowest income tax rate for 2011 for the state as reported by the CCH State Tax Handbook (the lowest bracket was used if the second lowest bracket was for incomes of over 30,000 ) (we assume no deductions). 26

State income tax rates are for the 2011 tax year. These rates were taken from the 2011 CCH Tax Handbook (various organizations provide the CCH State Tax Handbook rates (including The Tax Foundation)). No updates were available as of March 30, 2014

Using the excel file provide by The Tax Foundation, the second lowest tax bracket's rate is chosen as the rate for the state (except when the bracket is for incomes > 30k, as the original model suggests).

This only came into play in the later years for Vermont, North Dakota, and RI. To be consistent, I used the lowest tax bracket for all years for these states.

Note that I used the rate under "Single" since the model is only for adults. This is done by hand by importing correct numbers from the spreadsheet, which is imported via CSV below:


In [630]:
updated_state_tax_rate_df = pd.DataFrame.from_csv("data/taxes/formatted_state_taxes.csv")
updated_state_tax_rate_df = updated_state_tax_rate_df.apply(lambda x: x / 100)
updated_state_tax_rate_df.columns = map(int, updated_state_tax_rate_df.columns)

Federal Income Tax Rate

The federal income tax rate is calculated as a percentage of total income based on the average tax paid by median-income four-person families as reported by the Tax Policy Center of the Urban Institute and Brookings Institution for 2013. 27

The Tax Policy Center reported that the average federal income tax rate for 2013 was 5.32%. This estimate includes the effects of (1) the Earned Income Tax Credit (assuming two eligible children), (2) the Child Tax Credit expansion as part of EGTRRA, and (3) the Making Work Pay Credit enacted in the American Recovery and Reinvestment Act of 2009.

One issue is that the model authors used "Historical Federal Income Tax Rates for a Family of Four". Since I am focusing on single adults, I should use "Historical Average Federal Tax Rates for Nonelderly Childless Households". However, that data stops at 2011 for some reason, so for consistency, I will stick with the model definition and use the Family of Four rate.

Also, the model officially used a number that is different than what is on the updated link above. I will use the number used by the model to confirm the methodology (if I can), but use numbers from the updated data.


In [631]:
# Original model used 5.32% for the tax rate in 2013; the document this was taken from has since been updated. 
# So to confirm my methodology, I will use the 5.32% value; however, I will use updated information going forward
updated_federal_income_tax_rate = dict(zip(
        [2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014],
        [0.0802, 0.0671, 0.0656, 0.0534, 0.0537, 0.0569, 0.0585, 0.0593, 0.0354, 0.0447, 0.0450, 0.0559, 0.0584, 0.0579, 0.0534]))

# Create dataframe
updated_federal_income_tax_rate_df = \
    pd.DataFrame.from_dict({'federal income tax rate': updated_federal_income_tax_rate}).transpose()

# Value used by model, used for verifying methodology
federal_income_tax_rate = {
    2013: 0.0532
}

Minimum Wage or Mediun Wage per County or State


In [ ]:




Creating Final Merged Data Frame

(TOC)

Take all data loaded in prior into a multi-level index data frame


In [632]:
final_model_values = []

# For every year + fips combo in housing data ...
for year, df in fmr_data.iteritems():
    # Skip all non-model years for now
    if year not in model_years:
        continue
    
    for x in df.iterrows():
        # First entry in tuple is index value
        (_, fips) = x[0]
        
        ## Look up values for model variables
        
        # Extract this row's state and lookup region
        state = x[1]['state_alpha']
        region = state_to_region_mapping[state]

        # Some tax rate info is not found for these, so skip for now
        if state in ['PR', 'VI', 'GU']: continue
        
        # Housing cost is in this df (convert from monthly cost)
        housing_cost = x[1]['fmr1'] * 12
        
        # Regional food cost (convert from monthly cost)
        food_cost = national_monthly_food_cost_per_year_df.loc[region][year] * 12
        
        # Insurance cost
        # NOTE: Convert state name to alpha
        try:
            insurance_cost = insurance_costs_df.loc[state_to_statename_mapping[state]][year]
        except Exception as e:
            # If the state is not found, use US average
            insurance_cost = insurance_costs_df.loc['united states'][year]
            
        # CEX data
        healthcare_cost = health_costs_df.iloc[0][year]
        transportation_cost = transportation_costs_df.iloc[0][year]
        other_cost = other_costs_df.iloc[0][year]
        
        # Get the state tax rate for thos year and state
        try:
            state_tax_rate = updated_state_tax_rate_df.loc[state][year] / 100.0
        except Exception as e:
            print "State tax issue with %s and %d" % (state, year)
            break
        
        # Keep Population, even though its not a model variable (this will be used for proper weighted averages)
        pop = x[1]["pop"]
        
        # Create tuple of model values for this county in this year
        model_values = (year, fips, state, food_cost, insurance_cost, healthcare_cost, 
                        housing_cost, transportation_cost, other_cost, state_tax_rate, pop)
        final_model_values.append( model_values )

# Create dataframe from list of tuples
columns = ["year", "fips", "state", "food_cost", "insurance_cost", "healthcare_cost", 
            "housing_cost", "transportation_cost", "other_cost", "state_tax_rate", "pop"]
final_model_values_df = pd.DataFrame(final_model_values, columns=columns)
final_model_values_df = final_model_values_df.set_index(["year", "fips"])

In [633]:
# Add appropriate columns for federal tax rates
fica_tax_rates = [ ]
federal_income_tax_rates = [ ]

for x in final_model_values_df.iterrows():
    (year, _) = x[0]
    
    fica_tax_rates.append( updated_fica_tax_rate_df.iloc[0][year] )
    federal_income_tax_rates.append( updated_federal_income_tax_rate_df.iloc[0][year] )

# Create new columns in model dataframe
final_model_values_df['fica_tax_rate'] = fica_tax_rates
final_model_values_df['federal_income_tax_rate'] = federal_income_tax_rates

In [634]:
# Adding column that sums the costs and deals with state and federal taxes
final_model_values_df['total_cost'] = \
    (final_model_values_df['food_cost'] + \
    final_model_values_df['insurance_cost'] + \
    final_model_values_df['healthcare_cost'] + \
    final_model_values_df['housing_cost'] + \
    final_model_values_df['transportation_cost'] + \
    final_model_values_df['other_cost']) * (1 + final_model_values_df['fica_tax_rate'] + \
                                                final_model_values_df['federal_income_tax_rate'] + \
                                                final_model_values_df['state_tax_rate'] )

In [635]:
# Make it wasy to look up by FIPS code
final_model_values_df = final_model_values_df.reset_index('year')
final_model_values_df


Out[635]:
year state food_cost insurance_cost healthcare_cost housing_cost transportation_cost other_cost state_tax_rate pop fica_tax_rate federal_income_tax_rate total_cost
fips
101599999 2004 AL 2888.365130 914 2136.603659 3960 3924.435905 2800.961068 0.000400 112249 0.0765 0.0537 18795.507931
108199999 2004 AL 2888.365130 914 2136.603659 4656 3924.435905 2800.961068 0.000400 115092 0.0765 0.0537 19582.405531
100999999 2004 AL 2888.365130 914 2136.603659 5772 3924.435905 2800.961068 0.000400 51024 0.0765 0.0537 20844.155131
107399999 2004 AL 2888.365130 914 2136.603659 5772 3924.435905 2800.961068 0.000400 662047 0.0765 0.0537 20844.155131
111599999 2004 AL 2888.365130 914 2136.603659 5772 3924.435905 2800.961068 0.000400 64742 0.0765 0.0537 20844.155131
111799999 2004 AL 2888.365130 914 2136.603659 5772 3924.435905 2800.961068 0.000400 143293 0.0765 0.0537 20844.155131
111399999 2004 AL 2888.365130 914 2136.603659 5028 3924.435905 2800.961068 0.000400 49756 0.0765 0.0537 20002.988731
107999999 2004 AL 2888.365130 914 2136.603659 4488 3924.435905 2800.961068 0.000400 34803 0.0765 0.0537 19392.464731
110399999 2004 AL 2888.365130 914 2136.603659 4488 3924.435905 2800.961068 0.000400 111064 0.0765 0.0537 19392.464731
104599999 2004 AL 2888.365130 914 2136.603659 4116 3924.435905 2800.961068 0.000400 49129 0.0765 0.0537 18971.881531
106999999 2004 AL 2888.365130 914 2136.603659 4116 3924.435905 2800.961068 0.000400 88787 0.0765 0.0537 18971.881531
103399999 2004 AL 2888.365130 914 2136.603659 4332 3924.435905 2800.961068 0.000400 54984 0.0765 0.0537 19216.091131
107799999 2004 AL 2888.365130 914 2136.603659 4332 3924.435905 2800.961068 0.000400 87966 0.0765 0.0537 19216.091131
105599999 2004 AL 2888.365130 914 2136.603659 4092 3924.435905 2800.961068 0.000400 103459 0.0765 0.0537 18944.747131
108399999 2004 AL 2888.365130 914 2136.603659 5460 3924.435905 2800.961068 0.000400 65676 0.0765 0.0537 20491.407931
108999999 2004 AL 2888.365130 914 2136.603659 5460 3924.435905 2800.961068 0.000400 276700 0.0765 0.0537 20491.407931
100399999 2004 AL 2888.365130 914 2136.603659 5460 3924.435905 2800.961068 0.000400 140415 0.0765 0.0537 20491.407931
109799999 2004 AL 2888.365130 914 2136.603659 5460 3924.435905 2800.961068 0.000400 399843 0.0765 0.0537 20491.407931
100199999 2004 AL 2888.365130 914 2136.603659 5448 3924.435905 2800.961068 0.000400 43671 0.0765 0.0537 20477.840731
105199999 2004 AL 2888.365130 914 2136.603659 5448 3924.435905 2800.961068 0.000400 65874 0.0765 0.0537 20477.840731
110199999 2004 AL 2888.365130 914 2136.603659 5448 3924.435905 2800.961068 0.000400 223510 0.0765 0.0537 20477.840731
112599999 2004 AL 2888.365130 914 2136.603659 4704 3924.435905 2800.961068 0.000400 164875 0.0765 0.0537 19636.674331
100599999 2004 AL 2888.365130 914 2136.603659 3732 3924.435905 2800.961068 0.000400 29038 0.0765 0.0537 18537.731131
100799999 2004 AL 2888.365130 914 2136.603659 3732 3924.435905 2800.961068 0.000400 20826 0.0765 0.0537 18537.731131
101199999 2004 AL 2888.365130 914 2136.603659 3732 3924.435905 2800.961068 0.000400 11714 0.0765 0.0537 18537.731131
101399999 2004 AL 2888.365130 914 2136.603659 3732 3924.435905 2800.961068 0.000400 21399 0.0765 0.0537 18537.731131
101799999 2004 AL 2888.365130 914 2136.603659 3732 3924.435905 2800.961068 0.000400 36583 0.0765 0.0537 18537.731131
101999999 2004 AL 2888.365130 914 2136.603659 3732 3924.435905 2800.961068 0.000400 23988 0.0765 0.0537 18537.731131
102199999 2004 AL 2888.365130 914 2136.603659 3732 3924.435905 2800.961068 0.000400 39593 0.0765 0.0537 18537.731131
102399999 2004 AL 2888.365130 914 2136.603659 3732 3924.435905 2800.961068 0.000400 15922 0.0765 0.0537 18537.731131
... ... ... ... ... ... ... ... ... ... ... ... ... ...
5512999999 2014 WI 3101.261482 1263 2528.386391 6408 3965.869168 2195.290565 0.000584 15911 0.0765 0.0534 22001.262110
5513199999 2014 WI 3101.261482 1263 2528.386391 7752 3965.869168 2195.290565 0.000584 131887 0.0765 0.0534 23520.632606
5513399999 2014 WI 3101.261482 1263 2528.386391 7752 3965.869168 2195.290565 0.000584 389891 0.0765 0.0534 23520.632606
5513599999 2014 WI 3101.261482 1263 2528.386391 5844 3965.869168 2195.290565 0.000584 52410 0.0765 0.0534 21363.669134
5513799999 2014 WI 3101.261482 1263 2528.386391 5724 3965.869168 2195.290565 0.000584 24496 0.0765 0.0534 21228.011054
5513999999 2014 WI 3101.261482 1263 2528.386391 6036 3965.869168 2195.290565 0.000584 166994 0.0765 0.0534 21580.722062
5514199999 2014 WI 3101.261482 1263 2528.386391 5904 3965.869168 2195.290565 0.000584 74749 0.0765 0.0534 21431.498174
5600199999 2014 WY 3623.579206 1144 2528.386391 6936 3965.869168 2195.290565 0.000000 36299 0.0765 0.0534 23042.192310
5600399999 2014 WY 3623.579206 1144 2528.386391 5652 3965.869168 2195.290565 0.000000 11668 0.0765 0.0534 21591.400710
5600599999 2014 WY 3623.579206 1144 2528.386391 8520 3965.869168 2195.290565 0.000000 46133 0.0765 0.0534 24831.953910
5600799999 2014 WY 3623.579206 1144 2528.386391 6048 3965.869168 2195.290565 0.000000 15885 0.0765 0.0534 22038.841110
5600999999 2014 WY 3623.579206 1144 2528.386391 5664 3965.869168 2195.290565 0.000000 13833 0.0765 0.0534 21604.959510
5601199999 2014 WY 3623.579206 1144 2528.386391 6444 3965.869168 2195.290565 0.000000 7083 0.0765 0.0534 22486.281510
5601399999 2014 WY 3623.579206 1144 2528.386391 6408 3965.869168 2195.290565 0.000000 40123 0.0765 0.0534 22445.605110
5601599999 2014 WY 3623.579206 1144 2528.386391 5676 3965.869168 2195.290565 0.000000 13249 0.0765 0.0534 21618.518310
5601799999 2014 WY 3623.579206 1144 2528.386391 6444 3965.869168 2195.290565 0.000000 4812 0.0765 0.0534 22486.281510
5601999999 2014 WY 3623.579206 1144 2528.386391 6252 3965.869168 2195.290565 0.000000 8569 0.0765 0.0534 22269.340710
5602199999 2014 WY 3623.579206 1144 2528.386391 6996 3965.869168 2195.290565 0.000000 91738 0.0765 0.0534 23109.986310
5602399999 2014 WY 3623.579206 1144 2528.386391 7548 3965.869168 2195.290565 0.000000 18106 0.0765 0.0534 23733.691110
5602599999 2014 WY 3623.579206 1144 2528.386391 6912 3965.869168 2195.290565 0.000000 75450 0.0765 0.0534 23015.074710
5602799999 2014 WY 3623.579206 1144 2528.386391 6456 3965.869168 2195.290565 0.000000 2484 0.0765 0.0534 22499.840310
5602999999 2014 WY 3623.579206 1144 2528.386391 6288 3965.869168 2195.290565 0.000000 28205 0.0765 0.0534 22310.017110
5603199999 2014 WY 3623.579206 1144 2528.386391 6444 3965.869168 2195.290565 0.000000 8667 0.0765 0.0534 22486.281510
5603399999 2014 WY 3623.579206 1144 2528.386391 7752 3965.869168 2195.290565 0.000000 29116 0.0765 0.0534 23964.190710
5603599999 2014 WY 3623.579206 1144 2528.386391 9732 3965.869168 2195.290565 0.000000 10247 0.0765 0.0534 26201.392710
5603799999 2014 WY 3623.579206 1144 2528.386391 8064 3965.869168 2195.290565 0.000000 43806 0.0765 0.0534 24316.719510
5603999999 2014 WY 3623.579206 1144 2528.386391 9864 3965.869168 2195.290565 0.000000 21294 0.0765 0.0534 26350.539510
5604199999 2014 WY 3623.579206 1144 2528.386391 5868 3965.869168 2195.290565 0.000000 21118 0.0765 0.0534 21835.459110
5604399999 2014 WY 3623.579206 1144 2528.386391 6444 3965.869168 2195.290565 0.000000 8533 0.0765 0.0534 22486.281510
5604599999 2014 WY 3623.579206 1144 2528.386391 5940 3965.869168 2195.290565 0.000000 7208 0.0765 0.0534 21916.811910

34674 rows × 13 columns


In [8]:
# Serialize for future use
# final_model_values_df.to_csv("final_model_values_v4.csv")

# De-serialize
final_model_values_df = pd.DataFrame.from_csv("final_model_values_v4.csv")



Introductory Analysis

(TOC)

TODO:

Create visualizations on:

  • Find national mean living wage gap, plot it over time
  • Look at distributions over states of living wage gap over time (facet grid, each graph is a state showing gap over time)
  • Seperate counties based on race and find national means of gap per year
  • motion chart of states, x = gap, y = life exp, debt levels
  • look at delta between 2013 and 2014 across FIPS

Living Wage in Individual Counties

(TOC)

Plotted here are the living wage values for Kings County, NYC, NY and Orange County, CA:


In [24]:
# Plot three counties
f = plt.figure(figsize=(16, 10))
plt.plot(final_model_values_df.loc[kings_county]['year'], 
         final_model_values_df.loc[kings_county]['total_cost'], linestyle='--', marker='o', label="Kings County")
plt.plot(final_model_values_df.loc[orange_county]['year'], 
         final_model_values_df.loc[orange_county]['total_cost'], linestyle='--', marker='o', label="Orange County")
plt.plot(final_model_values_df.loc[mercer_county]['year'], 
         final_model_values_df.loc[mercer_county]['total_cost'], linestyle='--', marker='o', label="Mercer County")

f.axes[0].set_title('Kings, Orange and Mercer County')
f.axes[0].set_ylabel('Living Wage (Dollars / Year)')
f.axes[0].set_xlabel('Year')
l = f.axes[0].legend(loc='lower right')
plt.savefig("LaTeX/CurrentThesis/figures/county_trend.png")
plt.show()



In [12]:
final_model_values_df.loc[kings_county]


Out[12]:
year state food_cost insurance_cost healthcare_cost housing_cost transportation_cost other_cost state_tax_rate pop fica_tax_rate federal_income_tax_rate total_cost
fips
3604799999 2004 NY 3354.230474 899 2136.603659 11328 3924.435905 2800.961068 0.00045 2465326 0.0765 0.0537 27636.739250
3604799999 2005 NY 3298.289159 951 2128.872392 11592 4022.279086 2763.890813 0.00045 2465326 0.0765 0.0569 28069.966415
3604799999 2006 NY 3252.094925 1138 2158.144340 12036 4249.598878 2790.161546 0.00045 2465326 0.0765 0.0585 29094.770447
3604799999 2007 NY 3338.643532 1092 2046.496085 12828 4278.041226 2595.236769 0.00045 2465326 0.0765 0.0593 29745.227011
3604799999 2008 NY 3491.788629 1046 2003.912390 14220 4215.876585 2522.705462 0.00045 2465326 0.0765 0.0354 30589.939868
3604799999 2009 NY 3486.152857 1192 2228.358321 14160 3715.370400 2561.350255 0.00045 2465326 0.0765 0.0447 30669.535984
3604799999 2010 NY 3426.500615 1185 2209.464019 14664 3801.845418 2536.679344 0.00045 2465326 0.0765 0.0450 31216.563927
3604799999 2011 NY 3483.623290 1216 2229.735367 15132 3957.491897 2509.044764 0.00045 2465326 0.0565 0.0559 31747.268304
3604799999 2012 NY 3520.127668 1299 2419.270384 15360 4055.284079 2325.199510 0.00045 2465326 0.0565 0.0584 32321.595638
3604799999 2013 NY 3479.631205 1318 2417.518513 14916 4257.053626 2159.887757 0.00045 2504700 0.0765 0.0579 32397.801187
3604799999 2014 NY 3525.644632 1229 2528.386391 14580 3965.869168 2195.290565 0.00045 2504700 0.0765 0.0534 31677.144021

In [79]:
final_model_values_df.loc[mercer_county]


Out[79]:
year state food_cost insurance_cost healthcare_cost housing_cost transportation_cost other_cost state_tax_rate pop fica_tax_rate federal_income_tax_rate total_cost
fips
3402199999 2004 NJ 3354.230474 772.0 2136.603659 9576 3924.435905 2800.961068 0.000175 350761 0.0765 0.0537 25506.042736
3402199999 2005 NJ 3298.289159 1031.0 2128.872392 9756 4022.279086 2763.890813 0.000175 350761 0.0765 0.0569 26072.600724
3402199999 2006 NJ 3252.094925 1064.0 2158.144340 10200 4249.598878 2790.161546 0.000175 350761 0.0765 0.0585 26919.539598
3402199999 2007 NJ 3338.643532 1102.5 2046.496085 10824 4278.041226 2595.236769 0.000175 350761 0.0765 0.0593 27473.461784
3402199999 2008 NJ 3491.788629 1141.0 2003.912390 11184 4215.876585 2522.705462 0.000175 350761 0.0765 0.0354 27311.764715
3402199999 2009 NJ 3486.152857 1158.0 2228.358321 11496 3715.370400 2561.350255 0.000175 350761 0.0765 0.0447 27636.546845
3402199999 2010 NJ 3426.500615 1198.0 2209.464019 12060 3801.845418 2536.679344 0.000175 350761 0.0765 0.0450 28302.652542
3402199999 2011 NJ 3483.623290 1278.0 2229.735367 12216 3957.491897 2509.044764 0.000175 350761 0.0565 0.0559 28564.134083
3402199999 2012 NJ 3520.127668 1267.0 2419.270384 12648 4055.284079 2325.199510 0.000175 350761 0.0565 0.0584 29253.860645
3402199999 2013 NJ 3479.631205 1280.0 2417.518513 12012 4257.053626 2159.887757 0.000175 366513 0.0765 0.0579 29052.030811
3402199999 2014 NJ 3525.644632 1299.0 2528.386391 12204 3965.869168 2195.290565 0.000175 366513 0.0765 0.0534 29063.484419

State Averages of the Living Wage

(TOC)

TODO:

  • Add more analysis and comparisons of state averages
    • Show heat map of 2014? animate 2004 - 2014?

First, lets calculate the state averages:


In [9]:
state_weighted_averages = { }
for state in state_to_statename_mapping.keys():
    subdf = final_model_values_df[ final_model_values_df['state'] == state ][["year", "pop", "total_cost"]].reset_index("fips")
    subdf["weighted_total_cost"] = subdf["total_cost"] * subdf["pop"]
    subdf_grouped = subdf.groupby("year")
    state_weighted_averages[state] = subdf_grouped["weighted_total_cost"].sum() / subdf_grouped["pop"].sum()

Here, I will compare the average living wge in NJ compared to NY:


In [10]:
# pylab.close()

In [38]:
states_below_24000 = []
states_above_28000 = []
for state in state_weighted_averages:
    try:
        if state_weighted_averages[state].loc[2014] > 28000:
            states_above_28000.append(state)
        
        if state_weighted_averages[state].loc[2014] < 24100:
            states_below_24000.append(state)
    except:
        # Skip federal
        continue
print states_above_28000
print states_below_24000


[u'DC', u'HI', u'NJ', u'NY', u'CA', u'MA', u'CT', u'MD']
[u'WI', u'WV', u'WY', u'NM', u'TX', u'LA', u'NC', u'ND', u'NE', u'TN', u'AL', u'AR', u'GA', u'IN', u'IA', u'ID', u'ME', u'OK', u'OH', u'UT', u'MO', u'MN', u'MI', u'KS', u'MT', u'MS', u'SC', u'KY', u'OR', u'SD']

In [55]:
## Plot data for New York and NJ
f = plt.figure(figsize=(16, 15))

# x_vals = [2014.20, 2014.40, 2014.60, 2014.80, 2015, 2015.20]
x_vals = [2014.25, 2014.65, 2014.95, 2015.25]
# x_vals = [2014.50, 2015, 2015.50]
label_coords = []

for idx, state in enumerate(state_weighted_averages):
    try:
        # Color by group
        color = 'blue'
        if state in states_above_28000:
            color = 'red'
        elif state in states_below_24000:
            color = 'orange'
        
        # Style by group
        linestyle = "--"
        alpha=0.7
        if state in ["MD", "DC", "RI"]:
            linestyle='solid'
            alpha=None
        
        # Plot the state average trend line
        ax = state_weighted_averages[state].plot(label=state, alpha=alpha,linestyle=linestyle, marker='o', color=color)
        
        # Store coords of corresponding label
        y = state_weighted_averages[state].loc[2014]
        label_coords.append((state,y))
    except:
        # Skip federal
        continue

# Sort
sorted_label_coords = sorted(label_coords, key=lambda x: x[1])

# Print labels
for idx, coords in enumerate(sorted_label_coords):
    state, y = coords
    x = x_vals[idx % len(x_vals)]
    t = ax.text(x, y, state, ha='center', va='bottom', fontsize=9)
    
# Figue metadata
f.axes[0].set_title('Population Weighted State Averages of the Living Wage per Year')
f.axes[0].set_ylabel('Living Wage (Dollars / Year)')
f.axes[0].set_xlabel('Year')
f.axes[0].set_xlim((2004, 2016))
f.axes[0].set_ylim(19000, 33000)

# Legends
import matplotlib.patches as mpatches
red_patch = mpatches.Patch(color='red', label='> $28,000')
blue_patch = mpatches.Patch(color='blue', label='> \$24,000, < $28,000')
orange_patch = mpatches.Patch(color='orange', label='< $24,000')
plt.legend(handles=[red_patch, blue_patch, orange_patch], fontsize=15, loc="upper left")

f.axes[0].text(2004+0.12, 31500, "State Living Wage Avg Groupings", fontsize=15)

# Plot and save
plt.savefig("LaTeX/CurrentThesis/figures/state_trend.png")
plt.show()


Lets look at the STD of these groupings:


In [54]:
states_above_28000_var = []
for state in states_above_28000:
    states_above_28000_var.append(np.std(state_weighted_averages[state]))
print np.mean(states_above_28000_var)

states_in_middle = set(state_weighted_averages) - set(states_below_24000) - set(states_above_28000)
states_in_middle_var = []
for state in states_in_middle:
    if state != 'Federal':
        states_in_middle_var.append(np.std(state_weighted_averages[state]))

print np.mean(states_in_middle_var)

states_below_24000_var = []
for state in states_below_24000:
    if state in state_weighted_averages:
        states_below_24000_var.append(np.std(state_weighted_averages[state]))
print np.mean(states_below_24000_var)


1395.76856586
881.890448046
659.76634149

Choropleth Of Counties: 2004, 2006, 2014

(TOC)

Inspired by mbostock's Choropleth; adding JavaScript requirements:


In [56]:
%%javascript
require.config({
  paths: {
      d3: 'https://cdnjs.cloudflare.com/ajax/libs/d3/3.4.8/d3.min',
      queue: 'https://cdnjs.cloudflare.com/ajax/libs/queue-async/1.0.7/queue.min',
      topojson: 'https://cdnjs.cloudflare.com/ajax/libs/topojson/1.6.19/topojson.min'
  }
});


Python call to execute JS to print map for given data and year


In [57]:
from IPython.display import Javascript
def print_map(tsv_filename, map_id, svg_div_id, show_legend=False, domain_min=17000, domain_max=36500):
    if show_legend:
        show_legend = 'true'
    else:
        show_legend = 'false'
        
    return Javascript("""
    require(["d3", "queue", "topojson"], function(d3, queue, topojson) {
    
        var width = 960, 
            height = 600; 

        var rateById = d3.map(); 

        var quantize = d3.scale.quantize() 
            .domain([%d, %d]) 
            .range(d3.range(9).map(function(i) { return "q" + i + "-9"; })); 

        var projection = d3.geo.albersUsa() 
            .scale(1280) 
            .translate([width / 2, height / 2]); 

        var path = d3.geo.path() 
            .projection(projection); 

        var svg = d3.select("#%s").append("svg").attr("id", "%s") 
            .attr("width", width) 
            .attr("height", height); 

        queue() 
            .defer(d3.json, "/notebooks/Thesis/data/map/us.json") 
            .defer(d3.tsv, "%s", function(d) { rateById.set(d.id, +d.rate); }) 
            .await(ready); 

        function ready(error, us) { 
          if (error) throw error; 

          svg.append("g") 
              .attr("class", "counties") 
            .selectAll("path") 
              .data(topojson.feature(us, us.objects.counties).features) 
            .enter().append("path") 
              .attr("class", function(d) { return quantize(rateById.get(d.id)); }) 
              .attr("d", path); 

          svg.append("path") 
              .datum(topojson.mesh(us, us.objects.states, function(a, b) { return a !== b; })) 
              .attr("class", "states") 
              .attr("d", path); 
              
          var legend = d3.select('#%s-legend')
                          .append('ul')
                            .attr('class', 'list-inline');

            if(%s) {
    
                var keys = legend.selectAll('li.key')
                    .data(quantize.range());

                keys.enter().append('li')
                    .attr('class', function(d) {
                        console.log(d);
                        return d + " key";
                        var r = quantize.invertExtent(d);
                        console.log(r);
                    })
                    .text(function(d) {
                        var r = quantize.invertExtent(d);
                        return r[0].toFixed(2) + " - " + r[1].toFixed(2);
                        //return Math.floor(r[0]) + " - " + Math.floor(r[1]);
                    });
            }
        } 

        d3.select(self.frameElement).style("height", height + "px"); 
    });
    """ % (domain_min, domain_max, map_id, svg_div_id, tsv_filename, map_id, show_legend))

DEBUG: trying to save SVG to png file so output will show on nbviewer.org:

Setting styles and place holder for maps:


In [58]:
%%html
<style>

.counties {
  fill: none;
}

.states {
  fill: none;
  stroke: #fff;
  stroke-linejoin: round;
}

#legend {
    padding: 1.5em 0 0 1.5em;
}

li.key {
    border-top-width: 15px;
    border-top-style: solid;
    font-size: .75em;
    width: 10%;
    padding-left: 0;
    padding-right: 0;
}

.q0-9 { fill:rgb(247,251,255); border-top-color:rgb(247,251,255); }
.q1-9 { fill:rgb(222,235,247); border-top-color:rgb(222,235,247); }
.q2-9 { fill:rgb(198,219,239); border-top-color:rgb(198,219,239); }
.q3-9 { fill:rgb(158,202,225); border-top-color:rgb(158,202,225); }
.q4-9 { fill:rgb(107,174,214); border-top-color:rgb(107,174,214); }
.q5-9 { fill:rgb(66,146,198); border-top-color:rgb(66,146,198); }
.q6-9 { fill:rgb(33,113,181); border-top-color:rgb(33,113,181); }
.q7-9 { fill:rgb(8,81,156); border-top-color:rgb(8,81,156); }
.q8-9 { fill:rgb(8,48,107); border-top-color:rgb(8,48,107); }
</style>



In [59]:
%%html
<div id="map1-legend"></div>
<div id='map1'></div>



In [60]:
maps = []
for idx, year in enumerate([2004, 2007, 2010, 2014]):
    subdf = final_model_values_df[final_model_values_df["year"] == year].reset_index().rename(columns={"fips": "id", "total_cost": "rate"})
    subdf["id"] = subdf["id"].apply(lambda x: int(x / 100000.0))
    subdf.to_csv("data/map/df_map_%s.tsv" % year, sep="\t", index=False)
    # max(df_map_2004["rate"]) # 36230.017534800005
    # min(df_map_2004["rate"]) # 19966.945344099997
    # np.any(df_map_2004["id"] == 2300307065)
    maps.append(print_map("/notebooks/Thesis/data/map/df_map_%s.tsv" % year, "map1", "map1_div%d" % idx, show_legend=bool(idx==0)))

In [61]:
maps[0]


Out[61]:

In [62]:
maps[1]


Out[62]:

In [63]:
maps[2]


Out[63]:

In [64]:
maps[3]


Out[64]:

Regional Averages of the Living Wage

(TOC)

Lets look at all the states within a region.


In [35]:
# final_model_values_df[ final_model_values_df['state'].isin(region_to_state_mapping[REGION_EAST]) ]
region_weighted_averages = { }

for region in region_to_state_mapping.keys():
    region_weighted_averages[region] = pd.concat([state_weighted_averages[state] for state in region_to_state_mapping[region]], axis=1) 

del region_weighted_averages['base']
# del region_weighted_averages['east']    

f = plt.figure(figsize=(16, 10))

for region in region_weighted_averages:
    region_weighted_averages[region].apply(np.average, axis=1).plot(label=region, linestyle='--', marker='o')

f.axes[0].set_title('Average Living Wage by Region')
f.axes[0].set_ylabel('Living Wage (Dollars / Year)')
f.axes[0].set_xlabel('Year')
l = f.axes[0].legend(loc="lower right")
plt.savefig("LaTeX/CurrentThesis/figures/regional_trend.png")
plt.show()


National Average Breakdown Across 2004 - 2014

(TOC)

Lets see how the model variables changed over the course of a decade from the perspective of a national average, weighed by population.


In [61]:
fig, ax = plt.subplots(figsize=(16, 10))

index = np.arange(6)
bar_width = 0.07
y_min = 0

model_variable_columns = ['pop', 'food_cost', 'insurance_cost', 'healthcare_cost', 'housing_cost', 'transportation_cost', 'other_cost']

for idx, year in enumerate(model_years):
    data = final_model_values_df[ final_model_values_df['year'] == year ][model_variable_columns]
    
    weighted_data = []
    for column in model_variable_columns[1:]:
        weighted_data.append(sum(data['pop'] * data[column]) / sum(data['pop']))
    weighted_series = pd.Series(weighted_data, index=model_variable_columns[1:])

    rects = plt.bar(index + idx*bar_width, weighted_series, bar_width, alpha=0.5, color=husl_color_palette[0])
    for rect in rects:
        height = rect.get_height()
        ax.text(rect.get_x() + rect.get_width()/2., 1.007*height, '%d' % int(height), \
                ha='center', va='bottom', rotation=90, fontsize=12)
        ax.text(rect.get_x() + rect.get_width()/2., y_min + 100, '%d' % int(year), \
                ha='center', va='bottom', rotation=90, fontsize=10)
        
plt.xlabel('Model Variable')
plt.ylabel('Cost Per Year')
plt.title('Model Variable Comparison for National (Weighted) Averages: \n Each Variable from 2004 - 2014')
plt.xticks(index + 9*bar_width/2, weighted_series.index)
plt.tight_layout()
# plt.savefig("LaTeX/CurrentThesis/figures/national_avg_breakdown.png")
plt.show()


Living Wage Distribution in Most Populous Counties

(TOC)

This article from Business Insider lists the top 150 counties by population. Here I look at the distribution of living wage values separate by this grouping.

Some observations:

  • The populated county distributions are spread out more (std: 3123.05) than the non-populated one (std: 1791.76)
  • The populated county distributions seem almost bimodal, with peaks at both . The non-populated counties peak at .
    • In light of this, I separate the most populous counties into two subgroups in the subsequent visualizations, where 'most1' are the counties in the top 150 most populous counties but with a total cost in the bottom portion of the group; 'most2', the top portion of the group

TODO

  • What makes up this third large group

In [24]:
# Three subplots sharing both x/y axes
f, ((ax11, ax12), (ax21, ax22), (ax31, ax32), (ax41, ax42), (ax51, ax52), (ax61, ax62)) = \
    plt.subplots(6, 2, sharey=True, figsize=(16, 20))
axes = [ax11, ax21, ax31, ax41, ax51, ax61, ax12, ax22, ax32, ax42, ax52, ax62]

# Caclulate the inverse of the most populous counties (which is defined in FMR section)
least_populous_fips = list(set(final_model_values_df.index.values) - set(most_populous_fips))

# Seperate the data frame based on the fip codes of the populous and non-populous counties
most_populous_living_wage_df = final_model_values_df.loc[most_populous_fips]
least_populous_living_wage_df = final_model_values_df.loc[least_populous_fips]

# Color the histograms and density plots with same color
color_nonpopulous = current_color_palette[2]
color_populous = current_color_palette[0]
point_color = 'green'

# Keep track of variances of each distro
mpvariance = []
lpvariance = []

second_group_fips = []

# X-axis
x = numpy.arange(17000, 36000, 100)

split_threshold = [27500, 27500, 27500, 27500, 27500, 27500, 29000, 29500, 29500, 29500, 29500]

least_populous_maxes = []
most_populous_maxes = []
most_populous2_maxes = []

# setup_default_visuals()

plot_idx = 0
for year in model_years:
    # Subset data for populous counties by year, and store variance of distribution
    most_populous_living_wage_data = most_populous_living_wage_df[ most_populous_living_wage_df["year"] == year ]["total_cost"]
    mpvariance.append(np.std(most_populous_living_wage_data))
    # print len(most_populous_living_wage_data)
    
    # Subset data for non-populous counties by year, and store variance of distribution
    least_populous_living_wage_data = least_populous_living_wage_df[ least_populous_living_wage_df["year"] == year ]["total_cost"]
    lpvariance.append(np.std(least_populous_living_wage_data))
    # print len(least_populous_living_wage_data)
    
    # Histogram / density plot of populous counties living wage distribution
    axes[plot_idx].hist(least_populous_living_wage_data.values, 50, normed=True, label="Bottom 50%", alpha=0.5, color=color_populous)
    density = stats.kde.gaussian_kde(least_populous_living_wage_data)
    axes[plot_idx].plot(x, density(x), color=color_populous)
    
    # Least populous maxes
    least_populous_density = density(least_populous_living_wage_data.values)
    least_populous_max_idx = argmax(least_populous_density)
    axes[plot_idx].plot(least_populous_living_wage_data.values[least_populous_max_idx], 
                       least_populous_density[least_populous_max_idx], marker='o', color=point_color)
    least_populous_maxes.append(least_populous_living_wage_data.values[least_populous_max_idx])
    
    # Histogram / density plot of non-populous counties living wage distribution
    axes[plot_idx].hist(most_populous_living_wage_data.values, 30, normed=True, label="Upper 50%", alpha=0.5, color=color_nonpopulous)
    density = stats.kde.gaussian_kde(most_populous_living_wage_data.values)
    axes[plot_idx].plot(x, density(x), color=color_nonpopulous)
    
    # Most populous maxes
    first_group = most_populous_living_wage_data[most_populous_living_wage_data <= split_threshold[plot_idx]]
    first_group_density = density(first_group.values)
    most_populous_max_idx = argmax(first_group_density)
    axes[plot_idx].plot(first_group.values[most_populous_max_idx], 
                       first_group_density[most_populous_max_idx], marker='o', color=point_color)
    most_populous_maxes.append(first_group.values[most_populous_max_idx])
    
    second_group = most_populous_living_wage_data[most_populous_living_wage_data > split_threshold[plot_idx]]
    second_group_fips.append(set(second_group.index))
    second_group_density = density(second_group.values)
    most_populous2_max_idx = argmax(density(second_group.values))
    axes[plot_idx].plot(second_group.values[most_populous2_max_idx], 
                       second_group_density[most_populous2_max_idx], marker='o', color=point_color)
    most_populous2_maxes.append(second_group.values[most_populous2_max_idx])
    
    
    # Figure metadata and plot
    axes[plot_idx].legend(fontsize=12)
    axes[plot_idx].set_title("Populous vs Non-Populous Living Wage Distributions: %d" % year, fontsize=12, fontweight='bold')
    axes[plot_idx].set_xlim([17000, 36000])
    axes[plot_idx].tick_params(axis='both', which='major', labelsize=10)
    axes[plot_idx].tick_params(axis='both', which='minor', labelsize=10)

    # Go to next plot
    plot_idx += 1

f.subplots_adjust(wspace=0.05)
# plt.savefig("LaTeX/CurrentThesis/figures/populous_dist_and_trend.png")
plt.show()

setup_custom_visuals()

# Standard deviation
# print np.average(mpvariance) # 3123.05012618
# print np.average(lpvariance) # 1791.75923332



In [759]:
second_group_fips_set = set(second_group_fips[0])
for s in second_group_fips:
    second_group_fips_set = second_group_fips_set.union(s)

second_group_counties = []
for c in second_group_fips_set:
    second_group_counties.append(tuple(fmr_df.loc[2014].loc[c][["countyname", "state_alpha"]].values))
print "\n".join(["%s, %s" % c for c in sorted(second_group_counties, key=lambda x: x[1])])

# Print nicely
model_variable_diffs.columns = ['Counties with Highest Living Wages']
formatters = { column:'${:,.2f}'.format for column in model_variable_diffs.columns }
HTML(model_variable_diffs.to_html(formatters=formatters))


Orange County, CA
San Mateo County, CA
San Diego County, CA
Alameda County, CA
Riverside County, CA
San Bernardino County, CA
San Francisco County, CA
Sonoma County, CA
Santa Clara County, CA
Ventura County, CA
Los Angeles County, CA
Contra Costa County, CA
Fairfield County, CT
New Haven County, CT
District of Columbia, DC
Palm Beach County, FL
Broward County, FL
Honolulu County, HI
Plymouth County, MA
Norfolk County, MA
Suffolk County, MA
Essex County, MA
Middlesex County, MA
Bristol County, MA
Worcester County, MA
Prince George's County, MD
Montgomery County, MD
Passaic County, NJ
Ocean County, NJ
Morris County, NJ
Bergen County, NJ
Hudson County, NJ
Monmouth County, NJ
Essex County, NJ
Middlesex County, NJ
Union County, NJ
Queens County, NY
Suffolk County, NY
Bronx County, NY
New York County, NY
Richmond County, NY
Westchester County, NY
Nassau County, NY
Kings County, NY
Fairfax County, VA

In [ ]:


In [ ]:

Lets look at the living wage peaks of each subgroup over time:


In [25]:
f = plt.figure(figsize=(16, 10))
plt.plot(model_years, least_populous_maxes, label="Bottom 50%", linestyle='--', marker='o')
plt.plot(model_years, most_populous_maxes, label="Upper 50% - Bottom", linestyle='--', marker='o')
plt.plot(model_years, most_populous2_maxes, label="Upper 50% - Top", linestyle='--', marker='o')
f.axes[0].set_title('Living Wage Peaks by Population Group')
f.axes[0].set_ylabel('Living Wage (Dollars / Year)')
f.axes[0].set_xlabel('Year')
l = f.axes[0].legend(loc="lower right")
plt.savefig("LaTeX/CurrentThesis/figures/peaks_by_populous.png")
plt.show()


Whats interesting here is that the different between the peak in the least-populous county group and the two peaks in the most-populous group is very steady in value, despite a general increase over time. This seems to insinuate that living in a more populated area comes at the cost of an increase in the living wage needed, and that this cost is maintained over time.

What variables in the model account for the differences in the living wage across the least and most populated counties:


In [62]:
columns = ["food_cost", "insurance_cost", "healthcare_cost", "housing_cost", "transportation_cost", "other_cost", "total_cost"]

# Find the average model variables for least populous counties
least_populous_model_variable_averages = \
    final_model_values_df.loc[least_populous_fips].groupby("year").aggregate(np.average)
least_populous_model_variable_averages = least_populous_model_variable_averages[columns]

# For ease of coding, get subset of dataframe for the most populous counties
most_pop_df = final_model_values_df.loc[most_populous_fips]

# For each year, average model variables for each subgroup in the most populous counties
most_populous_model_variable_averages = None
most_populous2_model_variable_averages = None
for idx, year in enumerate(model_years):
    # Average the model variables across fips in the most populous counties for this year, append to df
    if most_populous_model_variable_averages is not None:
        most_populous_model_variable_averages = pd.concat([most_populous_model_variable_averages, most_pop_df[ most_pop_df["total_cost"] <= split_threshold[idx] ].loc[most_populous_fips]])
    else:
        most_populous_model_variable_averages = most_pop_df[ most_pop_df["total_cost"] <= split_threshold[idx] ].loc[most_populous_fips]

        
    # Average the model variables across fips in the most populous counties for this year, append to df
    if most_populous2_model_variable_averages is not None:
        most_populous2_model_variable_averages = pd.concat([most_populous2_model_variable_averages, most_pop_df[ most_pop_df["total_cost"] > split_threshold[idx] ].loc[most_populous_fips]])
    else:
        most_populous2_model_variable_averages = most_pop_df[ most_pop_df["total_cost"] > split_threshold[idx] ].loc[most_populous_fips]

# Aggregated by year and take an average; filter out only useful variables (tax rates hardly change)
most_populous_model_variable_averages = most_populous_model_variable_averages.groupby("year").aggregate(np.average)
most_populous_model_variable_averages = most_populous_model_variable_averages[columns]
most_populous2_model_variable_averages= most_populous2_model_variable_averages.groupby("year").aggregate(np.average)
most_populous2_model_variable_averages = most_populous2_model_variable_averages[columns]

# Differences between the model variables between each subgroup
model_variable_diffs = pd.concat([(most_populous2_model_variable_averages - most_populous_model_variable_averages).apply(np.average).transpose(),
                                (most_populous2_model_variable_averages - least_populous_model_variable_averages).apply(np.average).transpose(),
                                (most_populous_model_variable_averages - least_populous_model_variable_averages).apply(np.average).transpose()], axis=1)

# Print nicely
model_variable_diffs.columns = ['Diff "Upper 50%,Top" and "Upper 50%,Bottom"', 'Diff "Upper 50%,Top" and Least Pop', 'Diff "Upper 50%,Bottom" and Least Pop']
formatters = { column:'${:,.2f}'.format for column in model_variable_diffs.columns }
HTML(model_variable_diffs.to_html(formatters=formatters))


Out[62]:
Diff "Upper 50%,Top" and "Upper 50%,Bottom" Diff "Upper 50%,Top" and Least Pop Diff "Upper 50%,Bottom" and Least Pop
food_cost $197.52 $306.93 $109.41
insurance_cost $48.47 $67.46 $18.99
healthcare_cost $-0.00 $0.00 $0.00
housing_cost $5,190.87 $7,668.49 $2,477.62
transportation_cost $-0.00 $0.00 $0.00
other_cost $0.00 $0.00 $-0.00
total_cost $6,119.75 $9,053.42 $2,933.67

As we can see, that majority of the average difference between these sub-groups is the housing cost variable, followed by food costs and then insurance cost. Note: since we are not doing regional weighting on the CEX data, the three variables derived from them are the same nationally. Hence we expect no change between these variables. This should be looked into after an appropriate regional weighting scheme is found.

Another avenue for investigation is to look into each sub-group and see what percentage each individial model variable accounts for in the total cost average for that sub-group. We will only look at 2014:


In [65]:
# Percent change in the averages of the model variables broken down by population group
most_populous2_model_variable_perc = most_populous2_model_variable_averages.apply(lambda x: x * 100.0 / sum(x[:-1]), axis=1)
most_populous_model_variable_perc = most_populous_model_variable_averages.apply(lambda x: x * 100.0 / sum(x[:-1]), axis=1)
least_populous_model_variable_perc = least_populous_model_variable_averages.apply(lambda x: x * 100.0 / sum(x[:-1]), axis=1)

# Differences between the model variables between each subgroup for 2014
model_variable_percs = pd.concat([most_populous2_model_variable_perc.loc[model_years[-1]],
                                most_populous_model_variable_perc.loc[model_years[-1]],
                                least_populous_model_variable_perc.loc[model_years[-1]]], axis=1)

# No need for total_cost anymore
model_variable_percs = model_variable_percs[:-1]

model_variable_percs.columns = ['Upper 50%,Top', 'Upper 50%,Bottom', 'Bottom 50%']

# Create a 1 x 3 figure of stacked bar
fig = plt.figure(figsize=(20, 10))
ax = fig.gca()

width = 0.1
idx = [0, width, 2*width]
xtick_idx = [width/2, 3*width/2, 5*width/2]

data = None
prev_data = None
for rowidx in range(len(model_variable_percs)):
    color = current_color_palette[rowidx]
    data = model_variable_percs.iloc[rowidx]
    if prev_data is not None:
        ax.bar(idx, data, width, bottom=prev_data, color=color, alpha=0.5)
        prev_data += data
    else:
        ax.bar(idx, data, width, color=color, alpha=0.5)
        prev_data = data

# Add legend,title and Adjust x-axis labels and limits
ax.legend(labels=model_variable_percs.index, prop={'size':16})
ax.set_title("Model Variable Percentage of Total by Population Group (2014)")
ax.set_ylabel("Percent (Of Total Living Wage)")
plt.xticks(xtick_idx, model_variable_percs.columns, fontsize=18)
ax.set_xlim(0, 4*width)
ax.set_ylim((0, 100))
# plt.savefig("LaTeX/CurrentThesis/figures/variable_percentage_populous.png")
plt.show()


Living Wage Distribution by Race

(TOC)

TODO

  • Why does 2004 and 2005 not work in this breakdown?
  • Right now using only 2010 census data; can I get it for 2000?
  • What % of race has living wage above some threshold

Loading Race Data from 2010 Census


In [31]:
census_race_data_df = pd.DataFrame.from_csv(open("/home/james/Code/Masters/Thesis/data/census/aff_download/DEC_10_DP_DPDP1_with_ann_stripped.csv", "r"))
census_race_columns = ["GEO.display-label", "HD02_S078", "HD02_S079", "HD02_S080", "HD02_S081", 
                       "HD02_S089", 
                       "HD02_S114","HD02_S115","HD02_S116","HD02_S117","HD02_S118","HD02_S119"]
census_race_data_df = census_race_data_df[census_race_columns]
census_race_data_df.columns = ["Label", "white_perc", "black_perc", "american_indian_perc", "asian_perc", 
                               "pacific_islander_perc", 
                               "hispanic_perc", "hispanic_white_perc", "hispanic_black_perc", "hispanic_american_indian_perc", 
                               "hispanic_asian_perc", "hispanic_pacific_islander_perc"]

census_races = census_race_data_df.columns[1:6]
                                   
# adjust for latinos / hispanics
for race in census_races:
    census_race_data_df[race] = census_race_data_df[race] - census_race_data_df["hispanic_" + race]

census_races = [ "_".join(race.split('_')[0:-1]) for race in census_race_data_df.columns[1:7] ]
num_races = len(census_races)

census_race_data_df.index = census_race_data_df.index.map(lambda x: int(x.split("US")[1] + '99999'))

Living Wage County Non-Weighted Breakdown by Race

At the end of the day, I think not weighing the data by population doesn't make sense. Leaving this section here till I can fully create a weighted version of the boxplot / violinplot in matplotlib


In [34]:
race_years = range(2006, 2015)
num_race_years = len(race_years)
race_density = defaultdict(dict)
race_counties = defaultdict(dict)

thresholds = [ 35.0, 35.0, 20.0, 20.0, 5.0, 5.0, 35.0 ]

setup_default_visuals()

f, axes = plt.subplots(len(race_years), 1, sharey=True, figsize=(16, 30))
alpha = 0.2
plot_idx = 0

for year in race_years:
    boxplot_data = []
    for idx, race in enumerate(census_races):
        subdf = final_model_values_df[final_model_values_df['year'] == year].join(census_race_data_df).dropna()
        race_counties[race] = subdf[census_race_data_df.columns[1:]].apply(lambda x: x[race + "_perc"] > thresholds[idx], axis=1)
        boxplot_data.append(subdf[race_counties[race]]["total_cost"])
        race_density[race][year] = stats.kde.gaussian_kde(subdf[race_counties[race]]["total_cost"])

    axes[plot_idx].boxplot(boxplot_data)
    axes[plot_idx].set_title("Living Wage Distribution by County for Race By Year: %s" % year)
    axes[plot_idx].set_ylim((18000, 35000))
    axes[plot_idx].set_xticklabels(census_races)
    plot_idx += 1

# plt.savefig("LaTeX/CurrentThesis/figures/distribution_race_boxplot.png")
plt.show()



In [26]:
race_years = range(2006, 2015)
num_race_years = len(race_years)
race_density = defaultdict(dict)
race_counties = defaultdict(dict)

thresholds = [ 35.0, 35.0, 20.0, 20.0, 5.0, 5.0, 35.0 ]

f, axes = plt.subplots(len(census_races), 1, sharey=True, figsize=(16, 30))
alpha = 0.2
plot_idx = 0

for idx, race in enumerate(census_races):
    boxplot_data = []
    for year in race_years:
        subdf = final_model_values_df[final_model_values_df['year'] == year].join(census_race_data_df).dropna()
        race_counties[race] = subdf[census_race_data_df.columns[1:]].apply(lambda x: x[race + "_perc"] > thresholds[idx], axis=1)
        boxplot_data.append(subdf[race_counties[race]]["total_cost"])
        race_density[race][year] = stats.kde.gaussian_kde(subdf[race_counties[race]]["total_cost"])

    axes[plot_idx].boxplot(boxplot_data)
    axes[plot_idx].set_title("Living Wage Distribution by County for Race: %s" % race)
    axes[plot_idx].set_ylim((18000, 35000))
    axes[plot_idx].set_xticklabels(race_years)
    plot_idx += 1

# plt.savefig("LaTeX/CurrentThesis/figures/distribution_race_boxplot2.png")
plt.show()



In [29]:
f, ((ax11, ax12), (ax21, ax22), (ax31, ax32), (ax41, ax42), (ax51, ax52), (ax61, ax62), (ax71, ax72), (ax81, ax82), (ax91, ax92)) = \
    plt.subplots(num_race_years, 2, sharey=True, figsize=(16, 20))
axes = [ax11, ax21, ax31, ax41, ax51, ax61, ax71, ax81, ax91, ax12, ax22, ax32, ax42, ax52, ax62, ax72, ax82, ax92]

alpha = 0.2
x = numpy.arange(17000, 36000, 100)
plot_idx = 0

races_group1 = ["white", "black", "american_indian", "hispanic"]
races_group2 = ["asian", "pacific_islander"]

for year in race_years:
    for race in races_group1:
        axes[plot_idx].fill(x, race_density[race][year](x), alpha=alpha)#, color=current_color_palette[0])
        # axes[plot_idx].fill(x, race_density["white"][year](x), alpha=alpha)#, color=current_color_palette[0])
        # axes[plot_idx].fill(x, race_density["black"][year](x), alpha=alpha)#, color=current_color_palette[0])
        # axes[plot_idx].fill(x, race_density["american_indian"][year](x), alpha=alpha)#, color=current_color_palette[0])
        # axes[plot_idx].fill(x, race_density["hispanic"][year](x), alpha=alpha)#, color=current_color_palette[0])    
    
    axes[plot_idx].set_title(year,  fontsize=12)
    axes[plot_idx].legend(labels=races_group1, fontsize=12)
    axes[plot_idx].set_xlim((17000, 31000))
    axes[plot_idx].set_ylim((0, 0.00042))
    axes[plot_idx].set_ylabel("density", fontsize=10)
    axes[plot_idx].set_xticks(range(17000, 36000, 2000))
    
    for race in races_group2:
        axes[plot_idx + num_race_years].fill(x, race_density[race][year](x), alpha=alpha)#, color=current_color_palette[0])
        # axes[plot_idx + num_race_years].fill(x, race_density["asian"][year](x), alpha=alpha)#, color=current_color_palette[0])
        # axes[plot_idx + num_race_years].fill(x, race_density["pacific_islander"][year](x), alpha=alpha)#, color=current_color_palette[0])
    
    axes[plot_idx + num_race_years].set_title(year,  fontsize=12)
    axes[plot_idx + num_race_years].legend(labels=races_group2, fontsize=12)
    axes[plot_idx + num_race_years].set_xlim((17000, 36000))
    axes[plot_idx + num_race_years].set_ylim((0, 0.00042))
    axes[plot_idx + num_race_years].set_xticks(range(17000, 36000, 2000))
    
    # Go to next plot
    plot_idx += 1

f.tight_layout()
f.subplots_adjust(wspace=0.075)

setup_default_visuals()
# plt.savefig("LaTeX/CurrentThesis/figures/distribution_race_density.png")

setup_custom_visuals()


Living Wage County Weighted Breakdown by Race


In [45]:
# Find densities of weighted 
weighted_costs = defaultdict(dict)
# for year in race_years:
for year in model_years:
    subdf = final_model_values_df[final_model_values_df['year'] == year].join(census_race_data_df).dropna()

    for race in census_races:
        subdf[race + '_pop'] = (subdf['pop'] * (subdf[race + '_perc'] / 100.0)).map(int)
        
        if year not in weighted_costs or race not in weighted_costs[year]:
            weighted_cost = sm.nonparametric.KDEUnivariate(subdf["total_cost"])
            weighted_cost.fit(fft=False, weights=subdf[race + '_pop'])
            weighted_costs[year][race] = weighted_cost

Custom weighted violin plots (saved as public gist)


In [69]:
# import weighted
# from matplotlib.cbook import violin_stats
# from scipy import stats
# import statsmodels.api as sm

def vdensity_with_weights(weights):
    def vdensity(data, coords):
        # print "2w: %s" % weights
        # print "2d: %s" % data
        weighted_cost = sm.nonparametric.KDEUnivariate(data)
        weighted_cost.fit(fft=False, weights=weights)

        y = weighted_cost.evaluate(coords)
        # print "y: %s" % y
        return y
    return vdensity

from matplotlib.cbook import violin_stats

def custom_violin_stats(data, weights):
    # print "1w: %s" % weights
    # print "1d: %s" % data
    median = weighted.quantile_1D(data, weights, 0.5)
    mean, sumw = np.ma.average(data, weights=list(weights), returned=True)
    results = violin_stats(data, vdensity_with_weights(weights))
    
    # print "orig results: %s" % results[0]["vals"]
    
    results[0][u"mean"] = mean
    results[0][u"median"] = median
    results[0][u"min"] =  np.min(data)
    results[0][u"max"] =  np.max(data)

    # print "new results: %s" % results[0]["vals"]
    return results

In [75]:
# fig, axes = plt.subplots(nrows=6, ncols=2, figsize=(16, 10))
# f, (ax11, ax21, ax31, ax41, ax51, ax61, ax71, ax81, ax91) = \
f, ((ax11, ax12), (ax21, ax22), (ax31, ax32), (ax41, ax42), (ax51, ax52), (ax61, ax62)) = \
    plt.subplots(int(math.ceil(len(model_years)/2))+1, 2, sharey=True, figsize=(16, 40))
axes = [ax11, ax21, ax31, ax41, ax51, ax61, ax12, ax22, ax32, ax42, ax52, ax62]

race_labels = ["White", "Black", "American Indian", "Asian", "Pacific Islander", "Hispanic"]

plot_idx = 0

# for year in race_years:
for year in model_years:
    subdf = final_model_values_df[final_model_values_df['year'] == year].join(census_race_data_df).dropna()

    pos_idx = 0
    for race in census_races:
        subdf[race + '_pop'] = (subdf['pop'] * (subdf[race + '_perc'] / 100.0)).map(float)
        
        vpstats1 = custom_violin_stats(np.asarray(subdf['total_cost']), np.asarray(subdf[race + '_pop']))
        tmp = axes[plot_idx].violin(vpstats1, [pos_idx+1], vert=False, 
                              showmeans=True, showextrema=True, showmedians=True)

        for pc in tmp['bodies']:
            pc.set_facecolor(current_color_palette[pos_idx])
            pc.set_edgecolor('black')
        
        pos_idx += 1
    
    axes[plot_idx].set_xlim((17000, 37000))
    axes[plot_idx].set_ylim((0.5, len(census_races)+1))
    axes[plot_idx].set_title("Year: %s" % year, fontsize=30, fontweight='bold')
    axes[plot_idx].set_yticklabels(labels=[''] + race_labels)
    plot_idx += 1
    

f.tight_layout()
f.subplots_adjust(hspace=0.2, wspace=0.075)
plt.savefig("LaTeX/CurrentThesis/figures/weighted_distribution_race_violin.png")
plt.show()



In [74]:
f, (ax11, ax21, ax31, ax41, ax51, ax61) = \
    plt.subplots(num_races, sharey=True, figsize=(16, 40))
axes = [ax11, ax21, ax31, ax41, ax51, ax61]

plot_idx = 0

for idx, race in enumerate(census_races):
    pos_idx = 0
    #for year in race_years:
    for year in model_years:
        subdf = final_model_values_df[final_model_values_df['year'] == year].join(census_race_data_df).dropna()
        subdf[race + '_pop'] = (subdf['pop'] * (subdf[race + '_perc'] / 100.0)).map(float)
        
        vpstats1 = custom_violin_stats(np.asarray(subdf['total_cost']), np.asarray(subdf[race + '_pop']))
        tmp = axes[plot_idx].violin(vpstats1, [pos_idx], vert=False, 
                              showmeans=True, showextrema=True, showmedians=True)

        for pc in tmp['bodies']:
            pc.set_facecolor(current_color_palette[plot_idx])
            pc.set_edgecolor('black')
        
        pos_idx += 1
    
    axes[plot_idx].set_xlim((18000, 37000))
    axes[plot_idx].set_ylim((-0.5, len(model_years)))
    axes[plot_idx].set_title("Race: %s" % race_labels[idx], fontsize=30, fontweight='bold')
    axes[plot_idx].set_yticklabels([''] + [year for year in model_years if year % 2 == 0])
    plot_idx += 1
    
f.tight_layout()
f.subplots_adjust(hspace=0.2, wspace=0.075)
# plt.savefig("LaTeX/CurrentThesis/figures/weighted_distribution_race_violin2.png")
plt.show()



In [48]:
f, ((ax11, ax12), (ax21, ax22), (ax31, ax32), (ax41, ax42), (ax51, ax52), (ax61, ax62), (ax71, ax72), (ax81, ax82), (ax91, ax92)) = \
    plt.subplots(num_race_years, 2, sharey=True, figsize=(16, 20))
axes = [ax11, ax21, ax31, ax41, ax51, ax61, ax71, ax81, ax91, ax12, ax22, ax32, ax42, ax52, ax62, ax72, ax82, ax92]

plot_idx = 0
for year in race_years:
    for race in races_group1:
        axes[plot_idx].fill(weighted_costs[year][race].support, weighted_costs[year][race].density, alpha=alpha)
        
    axes[plot_idx].set_title(year,  fontsize=12)
    axes[plot_idx].legend(races_group1, fontsize=12, loc="upper right")
    axes[plot_idx].set_xlim((18500, 37000))
    # axes[plot_idx].set_ylim((0.00, 0.10))
    
    for race in races_group2:
        axes[plot_idx + num_race_years].fill(weighted_costs[year][race].support, weighted_costs[year][race].density, alpha=alpha)
        
    axes[plot_idx + num_race_years].set_title(year,  fontsize=12)
    axes[plot_idx + num_race_years].legend(races_group2, fontsize=12, loc="upper left")
    axes[plot_idx + num_race_years].set_xlim((18500, 37000))
    # axes[plot_idx].set_ylim((0.00, 0.10))
    
    # Go to next plot
    plot_idx += 1

f.tight_layout()
plt.subplots_adjust(wspace=0.075)
plt.savefig("LaTeX/CurrentThesis/figures/weighted_distribution_race_density.png")
setup_custom_visuals()


Population Weighted Averages Broken Down By Race


In [67]:
fig, ax = plt.subplots(figsize=(16, 10))
index = np.arange(len(census_races))
bar_width = 0.07
width = 0.1
y_min = 10000

for idx, year in enumerate(model_years):
    race_points = []
    subdf = final_model_values_df[final_model_values_df['year'] == year].join(census_race_data_df).dropna()
    for race in census_races:
        subdf[race + '_pop'] = subdf["pop"] * (subdf[race + '_perc'] / 100.0)
        race_points.append(sum(subdf[race + '_pop'] * subdf['total_cost']) / sum(subdf[race + '_pop']))
    rects = plt.bar(index + idx*bar_width, race_points, bar_width, color=husl_color_palette[0], alpha=0.25)
    for rect in rects:
        height = rect.get_height()
        ax.text(rect.get_x() + rect.get_width()/2., 1.005*height, '%d' % int(height), \
                ha='center', va='bottom', rotation=90, fontsize=12)
        ax.text(rect.get_x() + rect.get_width()/2., y_min + 100, '%d' % int(year), \
                ha='center', va='bottom', rotation=90, fontsize=10)

plt.xlabel('Race')
plt.ylabel('Average Living Wage')
plt.title('Average Living Wage By Race: \n Each Race from 2004 - 2014')
plt.xticks(index + 9*width/2, census_races)
plt.xlim((0, 6))
plt.ylim((y_min, 30000))
plt.tight_layout()
# plt.savefig("LaTeX/CurrentThesis/figures/average_by_race.png")
plt.show()


Living Wage Gap

(TOC)

The living wage gap can be defined as:

  • the difference between the living wage and the applicable minimum wage.
  • the difference between the living wage and the applicable median income.

For the median wage, two data sets from the Census.gov:

American Community Survey (ACS): Starting with 2000, the ACS provides subnational estimates of income and poverty for the nation, states, and places, counties, and metropolitan areas with a population of at least 250,000. The sample size of this survey from 2000 to 2004 was about 800,000 addresses per year. In 2006, the ACS began releasing annual subnational estimates of income and poverty for all places, counties, and metropolitan areas with a population of at least 65,000 as well as the nation and the states. The sample size of this survey increased to about three million addresses per year, making the ACS exceptionally useful for subnational analyses.

Small Area Income and Poverty Estimates : The SAIPE program produces single-year estimates of median household income and poverty for states and all counties, as well as population and poverty estimates for school districts. Since SAIPE estimates combine ACS data with administrative and other data, SAIPE estimates generally have lower variance than ACS estimates but are released later because they incorporate ACS data in the models. For counties and school districts, particularly those with populations below 65,000, the SAIPE program provides the most accurate subnational estimates of poverty. For counties, SAIPE generally provides the best single year estimates of median household income.

Based on the description above, I will choose SAIPE data for estimates of county median income.

For the minimum wage:

Basic Minimum Wages : CHANGES IN BASIC MINIMUM WAGES IN NON-FARM EMPLOYMENT UNDER STATE LAW: SELECTED YEARS 1968 TO 2013

We calculate the gap for each FIPS and year combination and look for any trends.

TODO

  • start with median wage living wage gap
    • load median wage data (DONE)
    • validate the data: what to do about empty cells?
    • for each county in each year, calculate the gap
    • calculate state averages
    • look at state averages
    • calculate regional averages
    • look at reginal averages
  • start with minimum wage living wage gap
    • load minimum wage data (DONE)
    • validate the data: how do you fill in states with no minimum wage? (DONE)
    • for each county in each year, calculate the gap
    • calculate state averages
    • look at state averages
    • calculate regional averages
    • look at regional averages
  • Using PINC survey data, find out what % of population live at or below living wage
  • what % of people make the living wage per year? can this even be done? would need more info on underlying distribution of wages outside of mean. pinc data?

Small Area Income and Poverty Estimates

First, we'll download the data for 2004 - 2014 and load it into a dataframe:


In [235]:
saipe_county_data = {}
saipe_state_data = {}
saipe_national_data = {}

saipe_path = "data/census/SAIPE"
saipe_columns=["state", "county", "state_alpha", "countyname", "median_income"]

for year in model_years:
# for year in [2006]:
    filename = os.path.join(saipe_path, "est%sALL_stripped.csv" % str(year)[2:])
    df = pd.DataFrame.from_csv(filename, header=None, index_col=False)
    df.columns = saipe_columns

    # Skip counties with no median income
    if year == model_years[0]:
        print "Dropped: %d" % np.argmax(df["median_income"] == ".")
        df = df[ df["median_income"] != "." ]
    else:
        startlen = len(df)
        df = df.dropna()
        print "Dropped: %d" % (startlen - len(df))
        
    # Cleanup median income
    try:
        if df["median_income"].dtype == np.object:
            df["median_income"] = df["median_income"].apply(lambda x: float(x.replace(",", "")))
    except Exception as e:
        pass
    
    # Filter for only counties
    saipe_county_data[year] = df[ df["county"] != 0 ].copy()
    
    # Calculate FIPS for each county
    try:
        saipe_county_data[year]["fips"] = saipe_county_data[year].apply(
            lambda row: pad_fips(str(row["state"]) + pad_county(int(row["county"]))), axis=1)
    except Exception as e:
        print df[df["county"].isnull()]
    
    # Filter out the states
    saipe_state_data[year] = df[ (df["county"] == 0) & (df["countyname"] != "United States") ].copy()
    
    # Filter out national average
    saipe_national_data[year] = df[ (df["county"] == 0) & (df["countyname"] == "United States") ].copy()
    
    print "Num counties: %d" % len(saipe_county_data[year])


Dropped: 0
Num counties: 3141
Dropped: 0
Num counties: 3141
Dropped: 0
Num counties: 3141
Dropped: 1
Num counties: 3140
Dropped: 1
Num counties: 3141
Dropped: 1
Num counties: 3142
Dropped: 0
Num counties: 3143
Dropped: 0
Num counties: 3143
Dropped: 0
Num counties: 3143
Dropped: 0
Num counties: 3143
Dropped: 0
Num counties: 3142

Minimum Wage Data

Here I load the minimum wage data from a hand-written CSV file that was generated from the HTML at the minimum wage data source. For any state with no minimum wage or a minimum wage less then the Fedderal minimum wage, we use the Federal minimum wage. To quote:

If a state minimum wage rate is less then the federal rate of \$7.25 per hour then employers are required to follow federal minimum wage law. State laws will only supersede federal requirements when the state law provides a more generous benefit to employees.


In [66]:
# Load minimum wage data gathered by hand from data source
minimum_wage_df = pd.DataFrame.from_csv("data/census/MinimumWage/minimum_wage.csv")

def convert_to_yearly(x):
    if x != -1: 
        return x * 52.0 * 40.0
    else: 
        return -1.0

# Convert to yearly wage
minimum_wage_df = minimum_wage_df.applymap(convert_to_yearly)
minimum_wage_df.columns = [ int(year) for year in minimum_wage_df.columns]
minimum_wage_df = minimum_wage_df[model_years]

# Fill in missing minimum wages with federal minimum wage
for year in model_years:
    minimum_wage_df[year][ minimum_wage_df[year] < minimum_wage_df[year]["Federal"] ] = minimum_wage_df[year]["Federal"]
    
# Convert statenames to state abbreviations for joining later
# minimum_wage_df = minimum_wage_df.rename(index={"District of Columbia": "DC"})
minimum_wage_df.index = map(lambda x: statename_to_state_mapping[x.lower()], minimum_wage_df.index)

Combine Wage Data with Model Values


In [67]:
minimum_wage_df


Out[67]:
2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014
Federal 10712.0 10712 10712.0 10712.0 12168.0 13624.0 15080.0 15080.0 15080.0 15080.0 15080.0
AL 10712.0 10712 10712.0 10712.0 12168.0 13624.0 15080.0 15080.0 15080.0 15080.0 15080.0
AK 14872.0 14872 14872.0 14872.0 14872.0 14872.0 16120.0 16120.0 16120.0 16120.0 16120.0
AZ 10712.0 10712 10712.0 14040.0 14352.0 15080.0 15080.0 15288.0 15912.0 16224.0 16432.0
AR 10712.0 10712 10712.0 13000.0 13000.0 13624.0 15080.0 15080.0 15080.0 15080.0 15080.0
CA 14040.0 14040 14040.0 15600.0 16640.0 16640.0 16640.0 16640.0 16640.0 16640.0 18720.0
CO 10712.0 10712 10712.0 14248.0 14601.6 15142.4 15080.0 15308.8 15891.2 16182.4 16640.0
CT 14768.0 14768 15392.0 15912.0 15912.0 16640.0 17160.0 17160.0 17160.0 17160.0 18096.0
DE 12792.0 12792 12792.0 13832.0 14872.0 14872.0 15080.0 15080.0 15080.0 15080.0 16120.0
FL 10712.0 10712 13312.0 13873.6 14123.2 14996.8 15080.0 15080.0 15953.6 16203.2 16494.4
GA 10712.0 10712 10712.0 10712.0 12168.0 13624.0 15080.0 15080.0 15080.0 15080.0 15080.0
HI 13000.0 13000 14040.0 15080.0 15080.0 15080.0 15080.0 15080.0 15080.0 15080.0 15080.0
ID 10712.0 10712 10712.0 10712.0 12168.0 13624.0 15080.0 15080.0 15080.0 15080.0 15080.0
IL 11440.0 13520 13520.0 13520.0 15600.0 16120.0 16640.0 17160.0 17160.0 17160.0 17160.0
IN 10712.0 10712 10712.0 10712.0 12168.0 13624.0 15080.0 15080.0 15080.0 15080.0 15080.0
IA 10712.0 10712 10712.0 10712.0 15080.0 15080.0 15080.0 15080.0 15080.0 15080.0 15080.0
KS 10712.0 10712 10712.0 10712.0 12168.0 13624.0 15080.0 15080.0 15080.0 15080.0 15080.0
KY 10712.0 10712 10712.0 10712.0 12168.0 13624.0 15080.0 15080.0 15080.0 15080.0 15080.0
LA 10712.0 10712 10712.0 10712.0 12168.0 13624.0 15080.0 15080.0 15080.0 15080.0 15080.0
ME 13000.0 13208 13520.0 14040.0 14560.0 15080.0 15600.0 15600.0 15600.0 15600.0 15600.0
MD 10712.0 10712 10712.0 12792.0 12792.0 13624.0 15080.0 15080.0 15080.0 15080.0 15080.0
MA 14040.0 14040 14040.0 15600.0 16640.0 16640.0 16640.0 16640.0 16640.0 16640.0 16640.0
MI 10712.0 10712 10712.0 14456.0 14872.0 15392.0 15392.0 15392.0 15392.0 15392.0 16952.0
MN 10712.0 10712 10920.0 10920.0 12168.0 13624.0 15080.0 15080.0 15080.0 15080.0 15080.0
MS 10712.0 10712 10712.0 10712.0 12168.0 13624.0 15080.0 15080.0 15080.0 15080.0 15080.0
MO 10712.0 10712 10712.0 13520.0 13832.0 14664.0 15080.0 15080.0 15080.0 15288.0 15600.0
MT 10712.0 10712 10712.0 10712.0 12168.0 13624.0 15080.0 15080.0 15080.0 15080.0 16432.0
NE 10712.0 10712 10712.0 10712.0 12168.0 13624.0 15080.0 15080.0 15080.0 15080.0 15080.0
NV 10712.0 10712 10712.0 12792.0 13166.4 13624.0 15080.0 15080.0 15080.0 15080.0 15080.0
NH 10712.0 10712 10712.0 10712.0 13520.0 15080.0 15080.0 15080.0 15080.0 15080.0 15080.0
NJ 10712.0 10712 12792.0 14872.0 14872.0 14872.0 15080.0 15080.0 15080.0 15080.0 17160.0
NM 10712.0 10712 10712.0 10712.0 13520.0 15600.0 15600.0 15600.0 15600.0 15600.0 15600.0
NY 10712.0 12480 14040.0 14872.0 14872.0 14872.0 15080.0 15080.0 15080.0 15080.0 16640.0
NC 10712.0 10712 10712.0 12792.0 12792.0 13624.0 15080.0 15080.0 15080.0 15080.0 15080.0
ND 10712.0 10712 10712.0 10712.0 12168.0 13624.0 15080.0 15080.0 15080.0 15080.0 15080.0
OH 10712.0 10712 10712.0 14248.0 14560.0 15184.0 15184.0 15392.0 16016.0 16328.0 15080.0
OK 10712.0 10712 10712.0 10712.0 12168.0 13624.0 15080.0 15080.0 15080.0 15080.0 15080.0
OR 14664.0 15080 15600.0 16224.0 16536.0 17472.0 17472.0 17680.0 18304.0 18616.0 18928.0
PA 10712.0 10712 10712.0 13000.0 14872.0 14872.0 15080.0 15080.0 15080.0 15080.0 15080.0
RI 14040.0 14040 14040.0 15392.0 15392.0 15392.0 15392.0 15392.0 15392.0 16120.0 16640.0
SC 10712.0 10712 10712.0 10712.0 12168.0 13624.0 15080.0 15080.0 15080.0 15080.0 15080.0
SD 10712.0 10712 10712.0 10712.0 12168.0 13624.0 15080.0 15080.0 15080.0 15080.0 15080.0
TN 10712.0 10712 10712.0 10712.0 12168.0 13624.0 15080.0 15080.0 15080.0 15080.0 15080.0
TX 10712.0 10712 10712.0 10712.0 12168.0 13624.0 15080.0 15080.0 15080.0 15080.0 15080.0
UT 10712.0 10712 10712.0 10712.0 12168.0 13624.0 15080.0 15080.0 15080.0 15080.0 15080.0
VT 14040.0 14560 15080.0 15662.4 15974.4 16764.8 16764.8 16952.0 17596.8 17888.0 18158.4
VA 10712.0 10712 10712.0 10712.0 12168.0 13624.0 15080.0 15080.0 15080.0 15080.0 15080.0
WA 14892.8 15288 15870.4 16494.4 16785.6 17784.0 17784.0 18033.6 18803.2 19115.2 19385.6
WV 10712.0 10712 10712.0 12168.0 13624.0 15080.0 15080.0 15080.0 15080.0 15080.0 15080.0
WI 10712.0 10712 11856.0 13520.0 13520.0 13624.0 15080.0 15080.0 15080.0 15080.0 15080.0
WY 10712.0 10712 10712.0 10712.0 12168.0 13624.0 15080.0 15080.0 15080.0 15080.0 15080.0
DC 12792.0 13728 14560.0 14560.0 14560.0 15704.0 17160.0 17160.0 17160.0 17160.0 19760.0

In [236]:
# Create dataframe that combines model values with appropriate minimum and media wage
wages_df = pd.DataFrame()
wages = [ ]
for year in model_years:
    final_model_values_subdf = final_model_values_df[ final_model_values_df['year'] == year]
    saipe_county_data[year].index = saipe_county_data[year]['fips']
    df = saipe_county_data[year].join(final_model_values_subdf, lsuffix='l', how='inner')[["pop", "state_alpha", "countyname", "median_income", "total_cost", "year"]]
    
    df = df.join(minimum_wage_df[year], on='state_alpha')
    df.rename(columns={year: "minimum_wage"}, inplace=True)
    
    wages.append(df)
wages_df = pd.concat(wages)
wages_df = wages_df.reset_index()
wages_df = wages_df.set_index(['fips'])

# wages_df = wages_df.reset_index()

# Calculate gaps
wages_df['inflation_adjust'] = wages_df['year'].map(lambda year: updated_inflation_multipliers[year])
wages_df['gap_median'] = wages_df['median_income'] * wages_df['inflation_adjust'] - wages_df['total_cost']
wages_df['gap_minimum'] = wages_df['minimum_wage'] * wages_df['inflation_adjust'] - wages_df['total_cost']

# wages_df['year'].map(lambda year: updated_inflation_multipliers[year]) * ([15.0 * 42 * 50] * len(wages_df['year']))

wages_df['gap_minimum15'] = ([ 15.0 * 40 * 52 ]*len(wages_df)) * wages_df['inflation_adjust'] - wages_df['total_cost']

wages_df


Out[236]:
pop state_alpha countyname median_income total_cost year minimum_wage inflation_adjust gap_median gap_minimum gap_minimum15
fips
100199999 43671 AL Autauga County 45379 20477.840731 2004 10712 1.259068 36657.419655 -6990.701101 18805.090229
100399999 140415 AL Baldwin County 42227 20491.407931 2004 10712 1.259068 32675.269173 -7004.268301 18791.523029
100599999 29038 AL Barbour County 26495 18537.731131 2004 10712 1.259068 14821.283478 -5050.591501 20745.199829
100799999 20826 AL Bibb County 34055 18537.731131 2004 10712 1.259068 24339.839826 -5050.591501 20745.199829
100999999 51024 AL Blount County 38643 20844.155131 2004 10712 1.259068 27810.021186 -7357.015501 18438.775829
101199999 11714 AL Bullock County 20485 18537.731131 2004 10712 1.259068 7254.282995 -5050.591501 20745.199829
101399999 21399 AL Butler County 27244 18537.731131 2004 10712 1.259068 15764.325634 -5050.591501 20745.199829
101599999 112249 AL Calhoun County 34813 18795.507931 2004 10712 1.259068 25036.436797 -5308.368301 20487.423029
101799999 36583 AL Chambers County 30573 18537.731131 2004 10712 1.259068 19955.764005 -5050.591501 20745.199829
101999999 23988 AL Cherokee County 33564 18537.731131 2004 10712 1.259068 23721.637290 -5050.591501 20745.199829
102199999 39593 AL Chilton County 35507 18537.731131 2004 10712 1.259068 26168.006997 -5050.591501 20745.199829
102399999 15922 AL Choctaw County 27690 18537.731131 2004 10712 1.259068 16325.870096 -5050.591501 20745.199829
102599999 27867 AL Clarke County 28776 18537.731131 2004 10712 1.259068 17693.218270 -5050.591501 20745.199829
102799999 14254 AL Clay County 31194 18537.731131 2004 10712 1.259068 20737.645419 -5050.591501 20745.199829
102999999 14123 AL Cleburne County 34128 18537.731131 2004 10712 1.259068 24431.751812 -5050.591501 20745.199829
103199999 43615 AL Coffee County 37066 19297.494331 2004 10712 1.259068 27371.131277 -5810.354701 19985.436629
103399999 54984 AL Colbert County 34575 19216.091131 2004 10712 1.259068 24316.195342 -5728.951501 20066.839829
103599999 14089 AL Conecuh County 24334 18537.731131 2004 10712 1.259068 12100.436881 -5050.591501 20745.199829
103799999 12202 AL Coosa County 30173 18537.731131 2004 10712 1.259068 19452.136685 -5050.591501 20745.199829
103999999 37631 AL Covington County 29237 18537.731131 2004 10712 1.259068 18273.648756 -5050.591501 20745.199829
104199999 13665 AL Crenshaw County 28909 18537.731131 2004 10712 1.259068 17860.674354 -5050.591501 20745.199829
104399999 77483 AL Cullman County 35150 18537.731131 2004 10712 1.259068 25718.519614 -5050.591501 20745.199829
104599999 49129 AL Dale County 34065 18971.881531 2004 10712 1.259068 23918.280109 -5484.741901 20311.049429
104799999 46365 AL Dallas County 24949 18537.731131 2004 10712 1.259068 12874.763886 -5050.591501 20745.199829
104999999 64452 AL DeKalb County 32286 18537.731131 2004 10712 1.259068 22112.548003 -5050.591501 20745.199829
105199999 65874 AL Elmore County 43645 20477.840731 2004 10712 1.259068 34474.195223 -6990.701101 18805.090229
105399999 38440 AL Escambia County 29330 18537.731131 2004 10712 1.259068 18390.742108 -5050.591501 20745.199829
105599999 103459 AL Etowah County 32727 18944.747131 2004 10712 1.259068 22260.781123 -5457.607501 20338.183829
105799999 18495 AL Fayette County 30595 18537.731131 2004 10712 1.259068 19983.463508 -5050.591501 20745.199829
105999999 31223 AL Franklin County 29234 18537.731131 2004 10712 1.259068 18269.871551 -5050.591501 20745.199829
... ... ... ... ... ... ... ... ... ... ... ...
5512999999 15911 WI Washburn County 42079 22001.262110 2014 15080 1.004655 20273.615635 -6851.064710 9343.973890
5513199999 131887 WI Washington County 69346 23520.632606 2014 15080 1.004655 46148.173024 -8370.435206 7824.603394
5513399999 389891 WI Waukesha County 76584 23520.632606 2014 15080 1.004655 53419.865914 -8370.435206 7824.603394
5513599999 52410 WI Waupaca County 52850 21363.669134 2014 15080 1.004655 31732.347616 -6213.471734 9981.566866
5513799999 24496 WI Waushara County 46835 21228.011054 2014 15080 1.004655 25825.005871 -6077.813654 10117.224946
5513999999 166994 WI Winnebago County 52711 21580.722062 2014 15080 1.004655 31375.647643 -6430.524662 9764.513938
5514199999 74749 WI Wood County 51003 21431.498174 2014 15080 1.004655 29808.920791 -6281.300774 9913.737826
5600199999 36299 WY Albany County 42838 23042.192310 2014 15080 1.004655 19995.218580 -7891.994910 8303.043690
5600399999 11668 WY Big Horn County 50437 21591.400710 2014 15080 1.004655 29080.383525 -6441.203310 9753.835290
5600599999 46133 WY Campbell County 79358 24831.953910 2014 15080 1.004655 54895.457580 -9681.756510 6513.282090
5600799999 15885 WY Carbon County 58225 22038.841110 2014 15080 1.004655 36457.196265 -6888.643710 9306.394890
5600999999 13833 WY Converse County 63166 21604.959510 2014 15080 1.004655 41855.078220 -6454.762110 9740.276490
5601199999 7083 WY Crook County 59701 22486.281510 2014 15080 1.004655 37492.626645 -7336.084110 8858.954490
5601399999 40123 WY Fremont County 48624 22445.605110 2014 15080 1.004655 26404.739610 -7295.407710 8899.630890
5601599999 13249 WY Goshen County 45721 21618.518310 2014 15080 1.004655 24315.312945 -6468.320910 9726.717690
5601799999 4812 WY Hot Springs County 44427 22486.281510 2014 15080 1.004655 22147.526175 -7336.084110 8858.954490
5601999999 8569 WY Johnson County 51434 22269.340710 2014 15080 1.004655 29404.084560 -7119.143310 9075.895290
5602199999 91738 WY Laramie County 57192 23109.986310 2014 15080 1.004655 34348.242450 -7959.788910 8235.249690
5602399999 18106 WY Lincoln County 63575 23733.691110 2014 15080 1.004655 40137.250515 -8583.493710 7611.544890
5602599999 75450 WY Natrona County 57427 23015.074710 2014 15080 1.004655 34679.247975 -7864.877310 8330.161290
5602799999 2484 WY Niobrara County 45251 22499.840310 2014 15080 1.004655 22961.803095 -7349.642910 8845.395690
5602999999 28205 WY Park County 55354 22310.017110 2014 15080 1.004655 33301.655760 -7159.819710 9035.218890
5603199999 8667 WY Platte County 49713 22486.281510 2014 15080 1.004655 27458.132505 -7336.084110 8858.954490
5603399999 29116 WY Sheridan County 54716 23964.190710 2014 15080 1.004655 31006.512270 -8813.993310 7381.045290
5603599999 10247 WY Sublette County 77222 26201.392710 2014 15080 1.004655 51380.075700 -11051.195310 5143.843290
5603799999 43806 WY Sweetwater County 72604 24316.719510 2014 15080 1.004655 48625.252110 -9166.522110 7028.516490
5603999999 21294 WY Teton County 75348 26350.539510 2014 15080 1.004655 49348.205430 -11200.342110 4994.696490
5604199999 21118 WY Uinta County 56800 21835.459110 2014 15080 1.004655 35228.944890 -6685.261710 9509.776890
5604399999 8533 WY Washakie County 50802 22486.281510 2014 15080 1.004655 28552.201800 -7336.084110 8858.954490
5604599999 7208 WY Weston County 55520 21916.811910 2014 15080 1.004655 33861.633690 -6766.614510 9428.424090

34633 rows × 11 columns


In [179]:
# Serialize for later use
# wages_df.to_csv('wages_df.v3.csv')

# De-serialize
wages_df = pd.DataFrame.from_csv('wages_df.v3.csv')

Analysis


In [237]:
f, axes = plt.subplots(1, 1, sharey=True, figsize=(16, 16))
# axes = [ax11, ax21, ax31, ax41, ax51, ax61, ax71, ax81, ax91, ax12, ax22, ax32, ax42, ax52, ax62, ax72, ax82, ax92]

setup_custom_visuals()

for idx, year in enumerate(model_years):
    # subdf = wages_df.loc[year]
    subdf = wages_df[ wages_df["year"] == year]

    vpstats1 = custom_violin_stats(np.asarray(subdf['gap_median']), np.asarray(subdf['pop']))
    tmp = axes.violin(vpstats1, [idx], vert=False, showmeans=True, showextrema=True, showmedians=True)
    for pc in tmp['bodies']:
        pc.set_facecolor(current_color_palette[0])
        pc.set_edgecolor('black')
            
#     tmp = axes.violinplot([subdf['gap_median']], [idx], vert=False, showmeans=True, showmedians=True)
#     for pc in tmp['bodies']:
#         pc.set_facecolor(current_color_palette[1])
#         pc.set_edgecolor('black')

axes.set_ylim((-0.5, len(model_years)))
axes.set_title("Weighted Distribution of the Gap Between \n County Median Wage and Living Wage (2004-2014)")
axes.set_ylabel("Year")
axes.set_xlabel("Living Wage Gap (Dollars)")
l = axes.set_yticklabels(['']+ [year for year in model_years if year % 2 == 0])
# plt.savefig("LaTeX/CurrentThesis/figures/median_gap_years.png")
plt.show()



In [ ]:


In [ ]:


In [68]:
f, axes = plt.subplots(1, 1, sharey=True, figsize=(16, 16))

for idx, year in enumerate(model_years):
    # subdf = wages_df.loc[year]
    subdf = wages_df[ wages_df["year"] == year ]

    subdf_gap_minimum = subdf['gap_minimum'][ np.invert(np.isnan(subdf['gap_minimum'])) ]
    subdf_gap_minimum15 = subdf['gap_minimum15'][ np.invert(np.isnan(subdf['gap_minimum'])) ]
    subdf_pop = subdf['pop'][ np.invert(np.isnan(subdf['gap_minimum'])) ]

    # vpstats1 = custom_violin_stats(np.asarray(subdf_gap_minimum), np.asarray(subdf_pop))
    # tmp = axes.violin(vpstats1, [0 + 0.6*idx], vert=False, showmeans=True, showextrema=True, showmedians=True)
    tmp = axes.violinplot([subdf_gap_minimum], [0 + 1.0*idx], vert=False, showextrema=True, showmeans=True, showmedians=True)
    for pc in tmp['bodies']:
        pc.set_facecolor(current_color_palette[0])
        pc.set_edgecolor('black')
    
    # vpstats1 = custom_violin_stats(np.asarray(subdf_gap_minimum15), np.asarray(subdf_pop))
    # tmp = axes.violin(vpstats1, [0 + 0.6*idx], vert=False, showmeans=True, showextrema=True, showmedians=True)
    tmp = axes.violinplot([subdf_gap_minimum15], [0 + 1.0*idx], vert=False, showextrema=True, showmeans=True, showmedians=True)
    for pc in tmp['bodies']:
        pc.set_facecolor(current_color_palette[1])
        pc.set_edgecolor('black')

plot([0, 0], [0, len(model_years)], color='k', linestyle='--', linewidth=0.5)

axes.set_title("Distribution of the Gap Between \n County Minimum Wage and Living Wage (2004-2014)")
axes.set_xlim((-18000, 22000))
axes.set_ylim((-0.5, len(model_years)))
axes.set_ylabel("Year")
axes.set_xlabel("Lving Wage Gap (Dollars)")
l = axes.set_yticklabels(['']+ [year for year in model_years if year % 2 == 0])
# plt.savefig("LaTeX/CurrentThesis/figures/minimum_gap_years.png")



In [238]:
state_averages_for_minimum_gap = None
for idx, year in enumerate(model_years):
    # subdf = wages_df.loc[year]
    subdf = wages_df[ wages_df["year"] == year ]
    results = subdf[["state_alpha", "gap_minimum"]].groupby("state_alpha").mean()
    results = results.rename(columns={"gap_minimum": year})
    if state_averages_for_minimum_gap is not None:
        state_averages_for_minimum_gap = state_averages_for_minimum_gap.join(results)
    else:
        state_averages_for_minimum_gap = results

state_averages_for_minimum_gap = state_averages_for_minimum_gap.dropna()
state_averages_for_minimum_gap = state_averages_for_minimum_gap.transpose()

In [296]:
f = plt.figure(figsize=(16, 15))

# x_vals = [2014.20, 2014.40, 2014.60, 2014.80, 2015, 2015.20]
x_vals = [2014.25, 2014.65, 2014.95, 2015.25]
# x_vals = [2014.50, 2015, 2015.50]
label_coords = []

for idx, state in enumerate(state_averages_for_minimum_gap):
    try:
        # Color by group
        color = 'blue'
        if state in states_above_28000:
            color = 'red'
        elif state in states_below_24000:
            color = 'orange'
        
        # Style by group
        linestyle = "--"
        alpha=0.7
        if state in ["MD", "DC", "RI"]:
            linestyle='solid'
            alpha=None
            
        # Plot the state average trend line
        ax = state_averages_for_minimum_gap[state].plot(label=state, alpha=alpha,linestyle=linestyle, marker='o', color=color)
        
        # Store coords of corresponding label
        y = state_averages_for_minimum_gap[state].loc[2014]
        label_coords.append((state,y))
    except:
        continue

# Sort
sorted_label_coords = sorted(label_coords, key=lambda x: x[1])

# Print labels
for idx, coords in enumerate(sorted_label_coords):
    state, y = coords
    x = x_vals[idx % len(x_vals)]
    t = ax.text(x, y, state, ha='center', va='bottom', fontsize=9)
    
f.axes[0].set_title("Gap Between the Minimum and Living Wage By State Per Year (2004 - 2014)")
f.axes[0].set_xlim((2004, 2016))
f.axes[0].set_ylabel("Minimum Wage Gap (Dollars / Year)")

# Legends
import matplotlib.patches as mpatches
red_patch = mpatches.Patch(color='red', label='> 28,000')
blue_patch = mpatches.Patch(color='blue', label='>14,000, <28,000')
orange_patch = mpatches.Patch(color='orange', label='< 14,000')
plt.legend(handles=[red_patch, blue_patch, orange_patch], fontsize=12, loc="lower right")

f.axes[0].text(2014, -15000, "State Avg Living Wage is:", fontsize=12)

plt.savefig("LaTeX/CurrentThesis/figures/state_gap_years.png")
plt.show()


Wage Distribution

Used non-family household to match our model definition, since this model is for single adults with no children


In [15]:
wage_dist_2014_df = pd.DataFrame.from_csv(open("data/wage_distribution/ACS_14_5YR_S1901/ACS_14_5YR_S1901_with_ann_stripped.csv", "r"), index_col=False)
wage_dist_2014_df = wage_dist_2014_df[["GEO.id2", "HC04_EST_VC01", "HC04_EST_VC02", "HC04_EST_VC03", "HC04_EST_VC04", "HC04_EST_VC05"]]
wage_dist_columns = ["fips", "nf_hh_total", 
                            "nf_hh_perc_<10000", 
                            "nf_hh_perc_<14999", 
                            "nf_hh_perc_<24999", 
                            "nf_hh_perc_<34999", ]
wage_dist_2014_df.columns = wage_dist_columns

wage_dist_endpoints = [10000, 14999, 24999, 34999]
wage_dist_endpoint_colidx = [ 
    ["nf_hh_perc_<10000"],
    ["nf_hh_perc_<10000", "nf_hh_perc_<14999"],
    ["nf_hh_perc_<10000", "nf_hh_perc_<14999", "nf_hh_perc_<24999"],
    ["nf_hh_perc_<10000", "nf_hh_perc_<14999", "nf_hh_perc_<24999", "nf_hh_perc_<34999"],
]

wage_dist_2014_df["fips"] = wage_dist_2014_df["fips"].apply(lambda x: int(str(x) + '99999'))
wage_dist_2014_df = wage_dist_2014_df.set_index("fips")
wage_dist_columns = wage_dist_columns[1:]
final_costs_2014 = final_model_values_df[ final_model_values_df["year"] == 2014 ]

# print set(final_costs_2014.index) - set(wage_dist_2014_df["fips"])  # 4 counties

wage_dist_2014_final_df = final_costs_2014.join(wage_dist_2014_df)[["total_cost"] + wage_dist_columns]

In [16]:
def calc_perc_below(row):
    tc = row["total_cost"]
    
    idx = 0 
    while idx < len(wage_dist_endpoints):
        endpoint = wage_dist_endpoints[idx]
        if tc < endpoint:
            break
        idx +=1
    try:
        return np.sum(row[wage_dist_endpoint_colidx[idx]])
    except Exception as e:
        return 0.0

    

def calc_perc_below_underest(row):
    tc = row["total_cost"]
    
    idx = 0 
    while idx < len(wage_dist_endpoints):
        endpoint = wage_dist_endpoints[idx]
        if tc < endpoint:
            if idx != 0:
                idx -= 1
            break
        idx +=1
    try:
        return np.sum(row[wage_dist_endpoint_colidx[idx]])
    except Exception as e:
        return 0.0
    
# For each county, calc % of single households earning the living wage or less
wage_dist_2014_final_df['perc_less_living_wage'] = wage_dist_2014_final_df.apply(calc_perc_below, axis=1)
wage_dist_2014_final_df['perc_less_living_wage_underest'] = wage_dist_2014_final_df.apply(calc_perc_below_underest, axis=1)

# Get percentage but filter out nulls
subdf = wage_dist_2014_final_df['perc_less_living_wage'][ np.invert(np.isnan(wage_dist_2014_final_df['perc_less_living_wage'])) ] / 100.0
subdf2 = wage_dist_2014_final_df['perc_less_living_wage_underest'][ np.invert(np.isnan(wage_dist_2014_final_df['perc_less_living_wage_underest'])) ] / 100.0

In [20]:
f, axes = plt.subplots(1, 1, sharey=True, figsize=(16, 16))
tmp = axes.hist([subdf], bins=50, alpha=0.5, normed=True, color=current_color_palette[0], label="Overestimate")
tmp = axes.hist([subdf2], bins=50, alpha=0.5, normed=True, color=current_color_palette[1], label="Underestimate")

# f, axes = plt.subplots(1, 1, sharey=True, figsize=(10, 10))

# vpstats1 = custom_violin_stats(np.asarray(subdf_gap_minimum), np.asarray(subdf_pop))
# tmp = axes.violin(vpstats1, [0 + 0.6*idx], vert=False, showmeans=True, showextrema=True, showmedians=True)
tmp = axes.violinplot([subdf], [0], vert=False, showextrema=True, showmeans=True, showmedians=True)
for pc in tmp['bodies']:
    pc.set_facecolor(current_color_palette[0])
    pc.set_edgecolor('black')
    
tmp = axes.violinplot([subdf2], [0], vert=False, showextrema=True, showmeans=True, showmedians=True)
for pc in tmp['bodies']:
    pc.set_facecolor(current_color_palette[1])
    pc.set_edgecolor('black')

axes.set_title("Estimated Distribution of Percentage of Single Households Earning \nthe Living Wage or Less Across Counties")
axes.set_xlabel("Percentage of Single Households Earning the Living Wage or Less")
axes.legend()
# axes.set_xlim(-0.1, 1)
plt.savefig("LaTeX/CurrentThesis/figures/perc_below_living.png")
plt.show()



In [184]:
# setup_custom_visuals()
# import collections
# counter = collections.Counter( 
#     [ val for val in wage_dist_2014_final_df[ wage_dist_2014_final_df['perc_less_living_wage_underest'] > 30.0 ]\
#         .join(fmr_df.loc[2014], lsuffix='l')['region'].values if val != 'base'] )

# fig, ax = plt.subplots(figsize=(16, 16))
# # ax = fig.gca()
# patches, texts, autotexts = ax.pie(counter.values(), autopct='%1.1f%%', textprops={"fontsize": "larger"}, labels=counter.keys(), startangle=90, radius=0.45, center=(0.5, 0.5), frame=True)
# for text in texts:
#     text.set_fontsize("medium")
# texts[-1].set_x(0.6)
# texts[-2].set_x(0.2)
# texts[-3].set_x(0.2)
# texts[-3].set_y(0.75)
# texts[-4].set_y(0.85)
# for patch in patches:
#     patch.set_alpha(0.5)
# plt.setp(ax.get_xticklabels(), visible=False)
# plt.setp(ax.get_yticklabels(), visible=False)
# l = ax.set_title("Regional Breakdown of Counties that have 50% or Greater \n Single Households that make the Living Wage or Less")
# plt.savefig("LaTeX/CurrentThesis/figures/regional_below_living_underest.png")

In [185]:
setup_custom_visuals()
import collections
counter = collections.Counter( 
    [ val for val in wage_dist_2014_final_df[ wage_dist_2014_final_df['perc_less_living_wage'] > 50.0 ]\
        .join(fmr_df.loc[2014], lsuffix='l')['region'].values if val != 'base'] )

fig, ax = plt.subplots(figsize=(16, 16))
# ax = fig.gca()
patches, texts, autotexts = ax.pie(counter.values(), autopct='%1.1f%%', textprops={"fontsize": "larger"}, labels=counter.keys(), startangle=90, radius=0.45, center=(0.5, 0.5), frame=True)
for text in texts:
    text.set_fontsize("medium")
texts[-1].set_x(0.6)
texts[-2].set_x(0.2)
texts[-3].set_x(0.2)
texts[-3].set_y(0.75)
texts[-4].set_y(0.85)
for patch in patches:
    patch.set_alpha(0.5)
plt.setp(ax.get_xticklabels(), visible=False)
plt.setp(ax.get_yticklabels(), visible=False)
l = ax.set_title("Regional Breakdown of Counties that have 50% or Greater \n Single Households that make the Living Wage or Less")
plt.savefig("LaTeX/CurrentThesis/figures/regional_below_living.png")



In [99]:
subdf = subdf.to_frame().reset_index().rename(columns={"fips": "id", "perc_less_living_wage": "rate"})
subdf["id"] = subdf["id"].apply(lambda x: int(x / 100000.0))
subdf.to_csv("data/map/perc_at_or_below_living_wage.tsv", sep="\t", index=False)


---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-99-1abf6512df37> in <module>()
----> 1 subdf = subdf.to_frame().reset_index().rename(columns={"fips": "id", "perc_less_living_wage": "rate"})
      2 subdf["id"] = subdf["id"].apply(lambda x: int(x / 100000.0))
      3 subdf.to_csv("data/map/perc_at_or_below_living_wage.tsv", sep="\t", index=False)

/home/james/Development/Masters/masters/local/lib/python2.7/site-packages/pandas/core/generic.pyc in __getattr__(self, name)
   2358                 return self[name]
   2359             raise AttributeError("'%s' object has no attribute '%s'" %
-> 2360                                  (type(self).__name__, name))
   2361 
   2362     def __setattr__(self, name, value):

AttributeError: 'DataFrame' object has no attribute 'to_frame'

In [100]:
%%html
<div id='map2-legend'></div>
<div id='map2'></div>



In [101]:
print_map("/notebooks/Thesis/data/map/perc_at_or_below_living_wage.tsv", "map2", "map2_div1", domain_min=0.1, domain_max=1, show_legend=True)


Out[101]:

Overall Results

  • The living wage seems to have leveled off in last few years. Not sure why, but future analysis would be interesting
  • Gap between minimum wage and the living wage reached a peak, and then some gains were made due to increases in the minimum wage. However, due to rent and inflation, current levels are close to the peak again.
  • The top 150 most populous counties have a much higher living wage than the rest of the country. This is mostly due to rent
  • The top 150 most populous counties can be split into two groups.
  • White and blacks seem to have similar living wage distributions, as they are relatively well mixed across the county. Other races, especially asians and PI, are concentrated into areas with high living wages.
  • Races seem to experience the same increases over time, with their distributions being controlled by population dyamics (since the living wage changes most with location)
  • When looking at only single households, we see that on average, 50% of them are making the living wage or less.
  • The region that has the most counties that have 50% or more of their single households making he living wage or less, is the South.

  • Median wage?

Future

  • If we could get wages broken down by race and by county, this would allow us to see how the living wage gap have evolved over time between races.


Appendix - Data Tables

(TOC)

Housing Costs Data Table

The model definition is specified in Model Variable: Housing.


In [272]:
fmr_df


Out[272]:
county countyname cousub fips fmr0 fmr0_inf pop pop100 pop2000 pop2010 region state state_alpha
year fips
2002 202099999 NaN Anchorage, AK MSA NaN 202099999 519 686.147427 NaN NaN NaN NaN base NaN AK
201399999 NaN ALEUTIANS EAST NaN 201399999 538 711.266505 NaN NaN NaN NaN base NaN AK
201699999 NaN ALEUTIANS WEST NaN 201699999 461 609.468139 NaN NaN NaN NaN base NaN AK
205099999 NaN BETHEL NaN 205099999 695 918.829407 NaN NaN NaN NaN base NaN AK
206099999 NaN BRISTOL BAY NaN 206099999 558 737.707639 NaN NaN NaN NaN base NaN AK
207099999 NaN DILLINGHAM NaN 207099999 671 887.100046 NaN NaN NaN NaN base NaN AK
209099999 NaN FAIRBANKS-NORTH STAR NaN 209099999 423 559.229984 NaN NaN NaN NaN base NaN AK
210099999 NaN HAINES BOROUGH NaN 210099999 501 662.350407 NaN NaN NaN NaN base NaN AK
211099999 NaN JUNEAU NaN 211099999 748 988.898412 NaN NaN NaN NaN base NaN AK
212299999 NaN KENAI PENINSULA NaN 212299999 455 601.535799 NaN NaN NaN NaN base NaN AK
213099999 NaN KETCHIKAN-GATEWAY NaN 213099999 549 725.809128 NaN NaN NaN NaN base NaN AK
215099999 NaN KODIAK ISLAND NaN 215099999 715 945.270541 NaN NaN NaN NaN base NaN AK
216499999 NaN LAKE & PENINSULA NaN 216499999 429 567.162324 NaN NaN NaN NaN base NaN AK
217099999 NaN MATANUSKA-SUSITNA NaN 217099999 481 635.909273 NaN NaN NaN NaN base NaN AK
218099999 NaN NOME NaN 218099999 707 934.694087 NaN NaN NaN NaN base NaN AK
218599999 NaN NORTH SLOPE NaN 218599999 804 1062.933587 NaN NaN NaN NaN base NaN AK
202099999 NaN NW ARCTIC NaN 202099999 851 1125.070252 NaN NaN NaN NaN base NaN AK
220199999 NaN PRINCE OF WALES-OUTER KETCHIKAN NaN 220199999 375 495.771263 NaN NaN NaN NaN base NaN AK
222099999 NaN SITKA BORO NaN 222099999 591 781.335510 NaN NaN NaN NaN base NaN AK
223299999 NaN SKAGWAY-YAKUTAT-ANGOON NaN 223299999 459 606.824025 NaN NaN NaN NaN base NaN AK
224099999 NaN SOUTHEAST FAIRBANKS NaN 224099999 471 622.688706 NaN NaN NaN NaN base NaN AK
226199999 NaN VALDEZ-CORDOVA NaN 226199999 562 742.995865 NaN NaN NaN NaN base NaN AK
227099999 NaN WADE HAMPTON NaN 227099999 402 531.466793 NaN NaN NaN NaN base NaN AK
228099999 NaN WRANGELL-PETERSBURG NaN 228099999 409 540.721190 NaN NaN NaN NaN base NaN AK
229099999 NaN YUKON-KOYUKUK NaN 229099999 536 708.622391 NaN NaN NaN NaN base NaN AK
112599999 NaN Anniston, AL MSA NaN 112599999 271 358.277366 NaN NaN NaN NaN south NaN AL
112599999 NaN Auburn-Opelika, AL MSA NaN 112599999 270 356.955309 NaN NaN NaN NaN south NaN AL
112599999 NaN Birmingham, AL MSA NaN 112599999 414 547.331474 NaN NaN NaN NaN south NaN AL
112599999 NaN Decatur, AL MSA NaN 112599999 360 475.940412 NaN NaN NaN NaN south NaN AL
112599999 NaN Dothan, AL MSA NaN 112599999 326 430.990484 NaN NaN NaN NaN south NaN AL
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2014 7210199999 101 Morovis Municipio 99999 7210199999 455 457.118025 NaN NaN NaN 32610 base 72 PR
7210399999 103 Naguabo Municipio 99999 7210399999 455 457.118025 NaN NaN NaN 26720 base 72 PR
7210599999 105 Naranjito Municipio 99999 7210599999 455 457.118025 NaN NaN NaN 30402 base 72 PR
7210799999 107 Orocovis Municipio 99999 7210799999 321 322.494255 NaN NaN NaN 23423 base 72 PR
7210999999 109 Patillas Municipio 99999 7210999999 310 311.443050 NaN NaN NaN 19277 base 72 PR
7211199999 111 Peñuelas Municipio 99999 7211199999 325 326.512875 NaN NaN NaN 24282 base 72 PR
7211399999 113 Ponce Municipio 99999 7211399999 346 347.610630 NaN NaN NaN 166327 base 72 PR
7211599999 115 Quebradillas Municipio 99999 7211599999 321 322.494255 NaN NaN NaN 25919 base 72 PR
7211799999 117 Rincón Municipio 99999 7211799999 333 334.550115 NaN NaN NaN 15200 base 72 PR
7211999999 119 Río Grande Municipio 99999 7211999999 455 457.118025 NaN NaN NaN 54304 base 72 PR
7212199999 121 Sabana Grande Municipio 99999 7212199999 333 334.550115 NaN NaN NaN 25265 base 72 PR
7212399999 123 Salinas Municipio 99999 7212399999 316 317.470980 NaN NaN NaN 31078 base 72 PR
7212599999 125 San Germán Municipio 99999 7212599999 333 334.550115 NaN NaN NaN 35527 base 72 PR
7212799999 127 San Juan Municipio 99999 7212799999 455 457.118025 NaN NaN NaN 395326 base 72 PR
7212999999 129 San Lorenzo Municipio 99999 7212999999 405 406.885275 NaN NaN NaN 41058 base 72 PR
7213199999 131 San Sebastián Municipio 99999 7213199999 333 334.550115 NaN NaN NaN 42430 base 72 PR
7213399999 133 Santa Isabel Municipio 99999 7213399999 316 317.470980 NaN NaN NaN 23274 base 72 PR
7213599999 135 Toa Alta Municipio 99999 7213599999 455 457.118025 NaN NaN NaN 74066 base 72 PR
7213799999 137 Toa Baja Municipio 99999 7213799999 455 457.118025 NaN NaN NaN 89609 base 72 PR
7213999999 139 Trujillo Alto Municipio 99999 7213999999 455 457.118025 NaN NaN NaN 74842 base 72 PR
7214199999 141 Utuado Municipio 99999 7214199999 316 317.470980 NaN NaN NaN 33149 base 72 PR
7214399999 143 Vega Alta Municipio 99999 7214399999 455 457.118025 NaN NaN NaN 39951 base 72 PR
7214599999 145 Vega Baja Municipio 99999 7214599999 455 457.118025 NaN NaN NaN 59662 base 72 PR
7214799999 147 Vieques Municipio 99999 7214799999 316 317.470980 NaN NaN NaN 9301 base 72 PR
7214999999 149 Villalba Municipio 99999 7214999999 346 347.610630 NaN NaN NaN 26073 base 72 PR
7215199999 151 Yabucoa Municipio 99999 7215199999 455 457.118025 NaN NaN NaN 37941 base 72 PR
7215399999 153 Yauco Municipio 99999 7215399999 325 326.512875 NaN NaN NaN 42043 base 72 PR
7801099999 10 St. Croix 99999 7801099999 588 590.737140 NaN NaN NaN 53234 base 78 VI
7802099999 20 St. John 99999 7802099999 668 671.109540 NaN NaN NaN 4197 base 78 VI
7803099999 30 St. Thomas 99999 7803099999 668 671.109540 NaN NaN NaN 51181 base 78 VI

40890 rows × 13 columns

Food Costs Data Table

The model definition is specified in Model Variable: Food Costs.


In [179]:
national_monthly_food_cost_per_year_df


Out[179]:
2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014
base 187.4400 191.3400 195.5400 205.5600 208.9800 212.700 224.5800 243.900 242.6400 264.848422 269.263723 272.085202 268.955169 270.7800
east 202.4352 206.6472 211.1832 222.0048 225.6984 229.716 242.5464 263.412 262.0512 286.036295 290.804821 293.852018 290.471582 292.4424
midwest 178.0680 181.7730 185.7630 195.2820 198.5310 202.065 213.3510 231.705 230.5080 251.606001 255.800537 258.480942 255.507410 257.2410
south 174.3192 177.9462 181.8522 191.1708 194.3514 197.811 208.8594 226.827 225.6552 246.309032 250.415262 253.039238 250.128307 251.8254
west 208.0584 212.3874 217.0494 228.1716 231.9678 236.097 249.2838 270.729 269.3304 293.981748 298.882732 302.014574 298.540237 300.5658

Insurance Costs Data Table

The model definition is specified in Model Variable: Health Insurance Costs.


In [144]:
insurance_costs_df


Out[144]:
2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014
alabama 622 620 636 726 838 891 925.0 959 1025 1193 1195 1279 1410 1362
alaska 449 533 433 535 895 714 764.0 814 842 909 1146 1208 1102 1286
arizona 503 547 560 662 752 803 807.0 811 851 974 1209 1200 1102 1096
arkansas 496 533 644 616 796 699 740.0 781 750 967 1028 1024 978 958
california 369 446 475 554 592 658 699.5 741 795 1145 1032 1035 1116 1129
colorado 499 590 581 677 741 717 857.5 998 971 965 1122 1148 1188 1244
connecticut 629 620 789 773 749 862 927.0 992 1082 1348 1273 1368 1536 1305
delaware 559 495 711 694 905 735 810.0 885 1101 1289 1183 1373 1459 1237
district of columbia 507 NaN 710 634 765 699 845.0 991 906 1180 1235 1133 1198 1197
florida 584 569 750 723 892 860 962.5 1065 969 1172 1202 1213 1440 1394
georgia 560 687 699 716 707 862 917.0 972 963 1054 1314 1160 1247 1203
hawaii 250 257 251 311 302 366 408.5 451 461 476 578 535 441 460
idaho 374 533 540 682 737 565 520.5 476 762 909 936 962 997 1039
illinois 502 615 625 693 846 822 888.0 954 1008 1224 1278 1190 1331 1306
indiana 570 611 732 646 701 833 891.5 950 1070 1231 1098 1201 1160 1347
iowa 646 505 682 653 762 784 770.0 756 855 1016 1142 1234 1224 1353
kansas 548 524 786 887 721 765 786.0 807 976 1011 1048 1340 1106 1072
kentucky 549 669 688 700 731 691 748.5 806 1000 968 1174 1149 1243 1314
louisiana 548 622 633 729 803 755 811.5 868 956 1356 1289 1118 1242 1302
maine 612 684 698 892 792 1072 1063.0 1054 981 1319 1179 1128 1144 1176
maryland 524 670 791 804 896 898 931.0 964 1105 1180 1310 1157 1338 1422
massachusetts 691 708 713 885 918 1011 1060.5 1110 1321 1311 1523 1566 1683 1588
michigan 475 502 538 558 704 682 708.5 735 946 1039 1166 1099 1178 1315
minnesota 499 669 604 759 809 810 850.5 891 994 1118 1151 1258 1260 1217
mississippi 501 547 503 637 648 727 738.0 749 994 1125 1045 1117 1122 1154
missouri 441 496 572 641 665 703 829.5 956 999 1054 1223 1175 1060 1243
montana 548 432 475 582 548 598 590.5 583 768 1140 872 826 902 1024
nebraska 548 678 875 736 776 873 941.5 1010 873 1184 1111 1183 1190 1322
nevada 426 413 474 620 691 551 707.0 863 842 838 1093 1063 1332 1204
new hampshire 548 665 753 944 965 1004 1134.0 1264 1087 1187 1310 1308 1447 1481
new jersey 516 621 611 613 847 902 967.5 1033 1045 1200 1281 1269 1282 1293
new mexico 548 536 593 611 794 726 838.0 950 934 1288 1346 1263 1142 1354
new york 506 648 625 714 781 965 956.0 947 1075 1187 1218 1301 1320 1223
north carolina 594 575 541 674 681 704 765.5 827 998 1012 1124 1033 1088 1151
north dakota 548 533 571 638 721 675 714.5 754 860 974 1045 1010 992 1136
ohio 567 604 579 687 674 781 833.0 885 1065 1040 1193 1276 1077 1260
oklahoma 386 680 625 575 680 650 718.5 787 815 1140 1096 1137 1086 1154
oregon 342 350 438 427 503 547 579.5 612 627 927 925 871 822 914
pennsylvania 435 580 533 661 659 881 866.5 852 917 1042 1127 1102 1098 1141
rhode island 568 533 820 794 840 862 956.0 1050 1207 1253 1470 1385 1433 1459
south carolina 569 517 668 731 776 810 829.5 849 898 1099 1299 1192 1163 1332
south dakota 548 533 771 722 807 718 802.5 887 890 1036 1191 1260 1378 1213
states not shown separately 548 533 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
tennessee 459 564 760 854 800 745 829.5 914 1010 1060 1092 1080 1194 1409
texas 473 530 548 663 617 728 786.0 844 991 1132 1058 1051 1161 1211
united states 498 565 606 671 723 788 835.0 882 957 1116 1155 1160 1197 1234
utah 491 562 638 614 796 826 789.0 752 772 1187 1013 1177 1114 1297
vermont 569 533 653 744 739 738 862.0 986 1008 1201 1293 1289 1197 1281
virginia 580 563 634 735 752 981 984.5 988 1060 1217 1145 1306 1272 1296
washington 303 306 385 427 384 623 596.0 569 640 815 917 910 695 937
west virginia 548 641 538 600 656 825 937.0 1049 1085 1019 1049 1151 1076 1297
wisconsin 544 647 830 795 859 885 977.0 1069 1011 1283 1161 1320 1248 1257
wyoming 548 487 574 645 673 655 686.0 717 729 876 928 1111 1083 1139

Tax Data Tables

The model definition is specified in Model Variable: Taxes.


In [164]:
updated_fica_tax_rate_df


Out[164]:
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014
fica rate 0.0765 0.0765 0.0765 0.0765 0.0765 0.0765 0.0765 0.0765 0.0765 0.0765 0.0765 0.0565 0.0565 0.0765 0.0765

In [162]:
updated_federal_income_tax_rate_df


Out[162]:
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014
federal income tax rate 0.0802 0.0671 0.0656 0.0534 0.0537 0.0569 0.0585 0.0593 0.0354 0.0447 0.045 0.0559 0.0584 0.0579 0.0534

In [566]:
updated_state_tax_rate_df


Out[566]:
2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014
AL 0.02000 0.04000 0.04000 0.04000 0.04000 0.04000 0.04000 0.04000 0.04000 0.04000 0.04000 0.04000 0.04000 0.0400
AK 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.0000
AZ 0.02870 0.03200 0.03200 0.03200 0.03200 0.03200 0.03040 0.02880 0.02880 0.02880 0.02880 0.02880 0.02880 0.0288
AR 0.01000 0.02500 0.02500 0.02500 0.02500 0.02500 0.02500 0.02500 0.02500 0.02500 0.02500 0.02500 0.02500 0.0250
CA 0.01000 0.02000 0.02000 0.02000 0.02000 0.02000 0.02000 0.02000 0.02250 0.02250 0.02000 0.02000 0.02000 0.0200
CO 0.04630 0.04630 0.04630 0.04630 0.04630 0.04630 0.04630 0.04630 0.04630 0.04630 0.04630 0.04630 0.04630 0.0463
CT 0.03000 0.04500 0.04500 0.05000 0.05000 0.05000 0.05000 0.05000 0.05000 0.05000 0.05000 0.05000 0.05000 0.0500
DE 0.00000 0.02200 0.02200 0.02200 0.02200 0.02200 0.02200 0.02200 0.02200 0.02200 0.02200 0.02200 0.02200 0.0220
FL 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.0000
GA 0.01000 0.02000 0.02000 0.02000 0.02000 0.02000 0.02000 0.02000 0.02000 0.02000 0.02000 0.02000 0.02000 0.0200
HI 0.01600 0.03700 0.03200 0.03200 0.03200 0.03200 0.03200 0.03200 0.03200 0.03200 0.03200 0.03200 0.03200 0.0320
ID 0.01900 0.03600 0.03600 0.03600 0.03600 0.03600 0.03600 0.03600 0.03600 0.03600 0.03600 0.03600 0.03600 0.0360
IL 0.03000 0.03000 0.03000 0.03000 0.03000 0.03000 0.03000 0.03000 0.03000 0.03000 0.05000 0.05000 0.05000 0.0500
IN 0.03400 0.03400 0.03400 0.03400 0.03400 0.03400 0.03400 0.03400 0.03400 0.03400 0.03400 0.03400 0.03400 0.0340
IA 0.00360 0.00720 0.00720 0.00720 0.00720 0.00720 0.00720 0.00720 0.00720 0.00720 0.00720 0.00720 0.00720 0.0072
KS 0.03500 0.06250 0.06250 0.06250 0.06250 0.06250 0.06250 0.06250 0.06250 0.06250 0.06250 0.06250 0.04900 0.0480
KY 0.02000 0.03000 0.03000 0.03000 0.03000 0.03000 0.03000 0.03000 0.03000 0.03000 0.03000 0.03000 0.03000 0.0300
LA 0.02000 0.04000 0.04000 0.04000 0.04000 0.04000 0.04000 0.04000 0.04000 0.04000 0.04000 0.04000 0.04000 0.0400
MA 0.02000 0.04500 0.04500 0.04500 0.04500 0.04500 0.04500 0.04500 0.04500 0.06500 0.04500 0.04500 0.06500 0.0650
MD 0.02000 0.03000 0.03000 0.03000 0.03000 0.03000 0.03000 0.03000 0.03000 0.03000 0.03000 0.03000 0.03000 0.0300
ME 0.05850 0.05600 0.05600 0.05300 0.05300 0.05300 0.05300 0.05300 0.05300 0.05300 0.05300 0.05250 0.05250 0.0525
MI 0.04200 0.04200 0.04100 0.04000 0.03900 0.03900 0.03900 0.04350 0.04350 0.04350 0.04350 0.04350 0.04250 0.0425
MN 0.05350 0.07050 0.07050 0.07050 0.07050 0.07050 0.07050 0.07050 0.07050 0.07050 0.07050 0.07050 0.07050 0.0705
MO 0.03000 0.04000 0.04000 0.04000 0.04000 0.04000 0.04000 0.04000 0.04000 0.04000 0.04000 0.04000 0.04000 0.0400
MS 0.01500 0.02000 0.02000 0.02000 0.02000 0.02000 0.02000 0.02000 0.02000 0.02000 0.02000 0.02000 0.02000 0.0200
MT 0.02000 0.03000 0.03000 0.03000 0.03000 0.02000 0.02000 0.02000 0.02000 0.02000 0.02000 0.02000 0.02000 0.0200
NE 0.02510 0.03490 0.03490 0.03570 0.03570 0.03570 0.03570 0.03570 0.03570 0.03570 0.03570 0.03570 0.03510 0.0351
NV 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.0000
NH 0.05000 0.05000 0.05000 0.05000 0.05000 0.05000 0.05000 0.05000 0.05000 0.05000 0.05000 0.05000 0.05000 0.0500
NJ 0.01400 0.01750 0.01750 0.01750 0.01750 0.01750 0.01750 0.01750 0.01750 0.01750 0.01750 0.01750 0.01750 0.0175
NM 0.01700 0.03200 0.03200 0.03200 0.03200 0.03200 0.03200 0.03200 0.03200 0.03200 0.03200 0.03200 0.03200 0.0320
NY 0.04000 0.04500 0.04500 0.04500 0.04500 0.04500 0.04500 0.04500 0.04500 0.04500 0.04500 0.04500 0.04500 0.0450
NC 0.06000 0.07000 0.07000 0.07000 0.07000 0.07000 0.07000 0.07000 0.07000 0.07000 0.07000 0.07000 0.07000 0.0580
ND 0.02100 0.02100 0.02100 0.02100 0.02100 0.02100 0.02100 0.02100 0.01840 0.01840 0.01840 0.01510 0.01510 0.0122
OH 0.00691 0.01486 0.01486 0.01486 0.01486 0.01424 0.01361 0.01299 0.01174 0.01174 0.01174 0.01174 0.01174 0.0107
OK 0.00500 0.01000 0.01000 0.01000 0.01000 0.01000 0.01000 0.01000 0.01000 0.01000 0.01000 0.01000 0.01000 0.0100
OR 0.05000 0.07000 0.07000 0.07000 0.07000 0.07000 0.07000 0.07000 0.07000 0.07000 0.07000 0.07000 0.07000 0.0700
PA 0.02800 0.02800 0.02800 0.03070 0.03070 0.03070 0.03070 0.03070 0.03070 0.03070 0.03070 0.03070 0.03070 0.0307
RI 0.03825 0.03750 0.03750 0.03750 0.03750 0.03750 0.03750 0.03750 0.03750 0.03750 0.03750 0.03750 0.03750 0.0375
SC 0.02500 0.03000 0.03000 0.03000 0.03000 0.03000 0.03000 0.03000 0.03000 0.03000 0.03000 0.03000 0.03000 0.0300
SD 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.0000
TN 0.06000 0.06000 0.06000 0.06000 0.06000 0.06000 0.06000 0.06000 0.06000 0.06000 0.06000 0.06000 0.06000 0.0600
TX 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.0000
UT 0.02300 0.03300 0.03300 0.03300 0.03300 0.03300 0.03300 0.05000 0.05000 0.05000 0.05000 0.05000 0.05000 0.0500
VT 0.03600 0.03600 0.03600 0.03600 0.03600 0.03600 0.03600 0.03600 0.03550 0.03550 0.03550 0.03550 0.03550 0.0355
VA 0.02000 0.03000 0.03000 0.03000 0.03000 0.03000 0.03000 0.03000 0.03000 0.03000 0.03000 0.03000 0.03000 0.0300
WA 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.0000
WV 0.03000 0.04000 0.04000 0.04000 0.04000 0.04000 0.04000 0.04000 0.04000 0.04000 0.04000 0.04000 0.04000 0.0400
WI 0.04730 0.06150 0.06150 0.06150 0.06150 0.06150 0.06150 0.06150 0.06150 0.06150 0.06150 0.06150 0.06150 0.0584
WY 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.0000
DC 0.06000 0.07500 0.07000 0.07500 0.07500 0.07500 0.07000 0.06000 0.06000 0.06000 0.06000 0.06000 0.06000 0.0600

Transportation Costs Data Tables

The model definition is specified in Model Variable: Transportation Costs.


In [456]:
transportation_costs_df


Out[456]:
2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014
transportation_costs 4211.621619 4060.070682 3580.393387 3924.435905 4022.279086 4249.598878 4278.041226 4215.876585 3715.3704 3801.845418 3957.491897 4055.284079 4257.053626 3965.869168

In [652]:
l = transportation_costs_df.transpose().plot(linestyle='--', marker='o')


Health Care Costs Data Table

The model definition is specified in Model Variable: Health Care Costs.


In [526]:
health_costs_df


Out[526]:
2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014
health_costs 1932.71496 2012.318321 2006.776085 2136.603659 2128.872392 2158.14434 2046.496085 2003.91239 2228.358321 2209.464019 2229.735367 2419.270384 2417.518513 2528.386391

In [654]:
l = health_costs_df.transpose().plot(linestyle='--', marker='o')


Other Necessities Cost Data Table

The model definition is specified in Model Variable: Other Necessities Cost


In [528]:
other_costs_df


Out[528]:
2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014
other_costs 2672.458026 2873.691516 2492.637841 2800.961068 2763.890813 2790.161546 2595.236769 2522.705462 2561.350255 2536.679344 2509.044763 2325.19951 2159.887757 2195.290565

In [656]:
l = other_costs_df.transpose().plot(linestyle='--', marker='o')


Appendix - Things to Revisit

(TOC)

  • Go back and redo CEX data to do proper regional weighting
  • Make sure all inflation adjustments are done and done correctly
  • Make sure understanding of CEX survey is solid and assignment of year is correct
  • Education is not taken into account; which makes sense, but an expanded model of a 'middle class wage' might also be instructive
  • Figure out the childcare costs for 2004 - 2005 and expand model into other family types outside of single adult